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Preface 


The general presentation of Volume 2 continues those aims which I set out in the 
preface to Volume 1 and I shall not repeat the comments here. The additional 
complication of constraints in an optimization problem does make the general level 
of the material more difficult and I have made some effort to present at least the 
basic ideas in a straightforward way. In some cases this has led me to first of all 
present some results in an intuitive way followed by a rigorous but more mathemat- 
ical derivation. The chapters in Volume 2 number on from those in Volume 1 so 
that cross referencing between the volumes is unambiguous. The selection of topics 
for this volume also has required some thought. Most finite dimensional problems 
of a continuous nature have been included but I have generally kept away from 
problems of a discrete or combinatorial nature since they have an entirely different 
character and the choice of method can be very specialized. In this case the nearest 
thing to a general purpose method is the branch and bound method, and since this 
is a transformation to a sequence of continuous problems of the type covered in 
this volume, I have included a straightforward description of the technique. 

A feature of this volume which I think is lacking in the literature is a treatment of 
non-differentiable optimization which is reasonably comprehensive and covers both 
theoretical and practical aspects adequately. I hope that the final chapter meets this 
need. Although only unconstrained problems of this type are discussed it is appro- 
priate to do it in this volume since a considerable background of constrained 
optimization theory is required. The subject of geometric programming is also 
included in the book because I think that it is potentially valuable, and again I hope 
that this treatment will turn out to be more straightforward and appealing than 
others in the literature. The subject of nonlinear programming is covered in some 
detail but there are difficulties in that this is a very active research area. To some 
extent therefore the presentation mirrors my assessment and prejudice as to how 
things will turn out, in the absence of a generally agreed point of view. However 

I have also tried to present various alternative approaches and their merits and 
demerits. Linear constraint programming, on the other hand, is now well developed 
and here the difficulty is that there are two distinct points of view. One is the 
traditional approach in which algorithms are presented as generalizations of.early 
linear programming methods which carry out pivoting in a tableau. The other is a 
more recent approach in terms of active set strategies: I regard this as more intuitive 
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and flexible and have therefore emphasized it, although both methods are presented 
and their relationship is explored. 

The occurrence of errors in the text is vexatious for any author and my experi- 
ence with Volume | and the greater complexity of Volume 2 assures me that a fair 
number of these will inevitably be present. I would be most grateful therefore to 
know of any serious errors and also to hear of any other observations which may be 
pertinent. A list of these for Volume 1 follows this preface. The fact that Volume 1 
is relatively free from typographical errors is entirely due to my wife Mary who 
proofread the entire volume despite not understanding very much of it. Perhaps this 
was an advantage! She also prepared the subject index and I am very grateful for her 
assistance. I must also repeat my thanks to those people mentioned in Volume 1 
who have helped me in many ways with the preparation of the book. In addition 
I am very grateful to R. S. Womersley for his constructive advice on Chapter 14. 

I would also like to thank Mrs V. Gandy, who typed much of the book, and 

Mrs C. Peters for their excellent typing. The preparation of this book has occupied 
me to a greater or lesser extent for the past seven years. The final impetus to com- 
plete Volume 2 came during a period of leave of absence at the Mathematics Depart- 
ment, University of Kentucky, Lexington, at the invitation of Professor R. Wets, 
and I am very grateful for this support and for the provision of typing and computing 
facilities. 


Dundee, October 1980 R. Fletcher 


Errata to Volume 1 


p. 21, line 2 from bottom: .. . it follows that (f*) — fATDy_gT§) = 1, 
contradicting (2.4.3). Hence g) +0. O 


p. 22, equation (2.4.8): replace > by <. 
p. 27, equation (2.6.1): delete all subscript 2s. 
p. 35, line 20: ... |h || >0. Thus... 


pp. 51 and 52, theorem 3.4.2: delete the word ‘least’ in the statement of the 
theorem and corollary. 


p. 64, equation (4.1.9): insert — after first. 
p. 78: run on line 16 after line 15 in (iv). 


p. 83: It has been pointed out to me by Dr D. Sorenson that GO) + vl positive 
semi-definite (together with (5.2.1) and feasibility) is also a sufficient (as well 
as necessary) condition for a global solution (not necessarily unique) to 
(5.1.2) using ll - Il. This follows because 


4(6- 6)" GM + py(s— 8) >0 

implies using (5.2.1) and (3.1.1) that 

q*)(8) — g® (8) > 40(6T & — 875) =0 

by the first order conditions. I am very grateful for this observation which 


nicely completes the theoretical properties of the system. 


p. 107, line 4 from bottom: J in bold type not italics. 
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Chapter 7 


Introduction 


7.1 Preview 


The motivation for studying constrained minimization has been discussed at some 
length in Chapter 1 of Volume 1 of this book. The mathematical background given 
there, and indeed many of the concepts which arise in unconstrained optimization, 
are important in the study of constrained optimization. In this volume, the selection 
of material from the extensive literature which exists has again been done with the 
main theme of practicality in mind. Thus topics such as reliability and effectiveness 
are uppermost; to some extent these are measured by convergence and rate of 
convergence results. These aspects are therefore studied in some detail, and together 
with the subject of optimality conditions they provide good material for an 
academic course. However the use of experimentation to validate the properties of 
an algorithm is still of paramount importance. In fact the study of constrained 
optimization is by no means as well advanced as for the unconstrained case. The 
writing of software is a much more complex task, and so comparative experimental 
results are much less widely available. Often there is even a lack of suitable test 
problems. Also many more special cases arise and the problem of assessing 
numerical evidence is more difficult. For all these reasons Volume 2 departs from 
the feature in Volume 1 of presenting detailed numerical evidence. Nonetheless 
important experimental results do exist in the literature and the selection of 
material is guided by such results. This lack of certainty also shows up in that the 
decision as to precisely what algorithm to recommend in any one case is-often not 
clear. For this reason the availability of good well-documented library software is 
often poor. Thus I appreciate the fact that many algorithms are necessarily used 
which are not ideal and I have tried to make users aware of defects in these algor- 
ithms and to enable them to mitigate their worst effects. 

The structure of most constrained optimization problems is essentially contained 
in the following: 


minimize f(x) x € IR” 
subject to c;(x) = 0, Ler (CABS) 
c;(x) 20, iEIL 


As in Volume 1, f(x) is the objective function, but there are additional constraint 
functions c;(x),i=1,2,...,p.£ is the index set of equations or equality con- 
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straints in the problem, J is the set of inequality constraints, and both these sets are 
finite. More general constraints can usually be put into this form: for example 

c;(x) <b becomes b — c;(x)> 0. If any point x’ satisfies all the constraints in 
(7.1.1) it is said to be a feasible point and the set of all such points is referred to as 
the feasible region R. As in Volume 1, maximization problems are easily handled by 
the transformation max f(x) = —min — f(x). Also, a local minimizer or solution 
(referred to by x*) is looked for, rather than a global minimizer, the computation 
of which can be difficult. It is possible to illustrate the effect of the constraints 
when n = 2 by drawing the zero contour of each constraint function. For an equality 
constraint, the line itself is the set of feasible points; for an inequality constraint the 
line marks the boundary of the feasible region and the infeasible side is conven- 
tionally shaded. This is shown in Figure 7.1.1. Case (i) has constraints x = x7, 

x > 0, which can be written c, (x) =x. — Xj, C2 (x) =X 1, ¢3(x) =x, and E = {1}, 

I = {2, 3}. Case (ii) has constraints x. >x7,x7 +x3 <1 which can be written as 

C1 (KX) =X — x2, e(x) = 1 — x7 — x3, 7 ={I, 2}, and FE empty. Formulation 

(7.1.1) covers most types of problem; however the condition that some variables 

x; take only discrete values is not included. This type of condition is covered in 
integer programming which is largely beyond the scope of this book. However a 
useful general purpose algorithm is the branch and bound method which enables 
the problem to be reduced to a sequence of smooth problems and hence solved by 
other techniques given in this book. This is described in Section 13.1. Another type 
of condition which is not included in (7.1.1) is a constraint of the form c;(x) >0; 
something more is said about this case in Section 7.2. In fact there is often some 
choice in how best to pose the problem in the first instance and a number of 
possibilities of this type are discussed in Section 7.2. 

It is assumed in (7.1.1) that the functions c;(x) are continuous which implies 
that R is closed. It is also assumed that f(x) is continuous for all x © R and prefer- 
ably for all x € IR”. If in addition the feasible region is bounded (Ja > 0 such that 
x |<aV x €R), then it follows that a solution x* exists. If not then the problem 
may be unbounded (f(x) > —ce) or may not have a minimizing point. The problem 
also has no solution when R is empty, that is when the constraints are inconsistent. 
In fact most practical methods require the stronger assumption that the objective 
and constraint functions are also smooth in that their first and often second con- 
tinuous derivatives exist (fc; € €! or C7). The notation Vf = g) and V’f (=G) for 





Case (ii) 


Figure 7.1.1 Examples of feasible regions 
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the gradient vector and Hessian matrix of f is described in Volume 1. The notation 
Vc; and V*c; is used to denote the corresponding first and second derivatives of any 
constraint function c;. The vector Vc; is also denoted by a; and is referred to as the 
normal vector of the constraint c;. Note that a; refers to the ith vector in a set and 
not to the ith component of a. These vectors are sometimes collected into columns 
of the Jacobian matrix A (although this rule is contradicted in the simplex method 
(Chapter 8) in which the normal vectors are the rows of A). The vector a; (that is 
a;(x) evaluated at x = x’) is the direction of greatest increase of c;(x) at x’, and if 
c; = Oand i €/ then the direction is on the feasible side of the constraint (see 
Figure 7.1.2) and is at right angles to the zero contour. Most of Volume 2 assumes 
the existence of these derivatives which can be used, for example, to characterize 
optimality conditions, as described in Chapter 9, which generalize the results for 
unconstrained minimization given in Volume 1, Section 2.1. This is not to say 
necessarily that user supplied formulae for these derivatives are required in any 
method. Mostly, however, formulae for first derivatives are required, and in some 
cases formulae for second derivatives also. Methods which require no derivative 
information have not been studied to any great extent and the obvious advice is to 
estimate these derivatives by finite differences (see Volume 1), although the result- 
ing algorithm is likely to be less robust and effective when this is done. A different 
situation arises when the functions f and c; do not have continuous derivatives, 
which is referred to as non-differentiable or non-smooth optimization. In this case 
methods for smooth problems are not appropriate and special attention must be 
given to the surfaces of non-differentiability. These behave somewhat like the 
boundary of a constraint in (7.1.1) and it is therefore appropriate to discuss the 
problem within the structure of this volume. This is done for unconstrained non- 
differentiable optimization in Chapter 14: in fact it is also possible to generalize 
these ideas to include both smooth and non-smooth constraint functions, but this 
is beyond the scope of the book, although some references are given. 

Another important concept is that of an active or binding constraint. Active 
constraints at any point x’ are defined by the index set 


A’ = A(x')={i: ¢;(x')= 0} C72) 


so that any constraint is active at x’ if x’ is on the boundary of its feasible region. If 
x’ is feasible then .’ D E clearly follows. In particular the set ./* of active con- 
straints at the solution of (7.1.1) is of some importance. If this set is known then 





>, c, are 


Figure 7.1.2 The normal vector 
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the remaining constraints can be ignored (locally) and the problem can be treated 
as an equality constraint problem with E = ./*. Also constraints with i € o/* can 
be perturbed by small amounts without affecting the local solution whereas this is 
not usually true for an active constraint. An example is given by the problem: 
minimize f(x) = —x, — X2 subject tox, > x? and x7 +x3 <1. Clearly from 
Figure 7.1.3 the solution is achieved at x* = (1/4/2, 1/\/2)' when the contour of 
f(x) is a tangent to the unit circle. Thus the active set in the notation of Figure 
7.1.1(ii) is o/* = {2}, and the circle constraint cz (x) is active. Likewise the parabola 
constraint c, (x) is inactive and can be perturbed or removed from the problem 
without changing x*. A further refinement of this definition to include strongly 
active and weakly active constraints is given in Figure 9.1.2. 

Methods for the solution of (7.1.1) are usually iterative so that a sequence 
x) x) x@)._. say, is generated from a given point x) hopefully converging 
to x*. If x* is a member of the sequence then the method is said to terminate. 
Some early methods for constrained optimization were developed in an ad hoc way 
and are not strongly supported theoretically. Because of this these methods are 
often unreliable and expensive for problems of any size and they are not described 
here. However a review of what has been attempted is given by Swann (1974). The 
subject of constrained optimization splits into two main parts, linear constraint 
programming and nonlinear programming which have quite different features. 
In linear constraint programming each constraint is a linear function c;(x) = 
a} x — b;. The boundary of the feasible region for any one such constraint is a 
hyperplane, and the normal vector Vc; is constant and is again the vector a;. Linear 
constraint problems can be handled by a combination of an elimination method 
and an active set method (see Section 7.2) and the iterates x“*) are always feasible 
points. The simplest cases are when the objective function is either linear or 
quadratic (linear programming ot quadratic programming — Chapters 8 and 10 
respectively) in both of which cases algorithms which terminate can be determined. 
The application to a general objective function is given in Chapter 11, and in this 
case many of the possibilities for unconstrained optimization in Volume 1 carry 
over directly. For example there are analogues of Newton’s method, quasi-Newton 


Contours of f 1D 
~ NN 
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Figure 7.1.3 Active and inactive 
constraints 


methods, the Gauss—Newton method, restricted step methods, and no-derivative 
methods which use finite difference approximations. Similar considerations hold in 
regard to using line searches and in regard to deciding what type: of convergence test 
to use to terminate the iteration. Special cases of a linear constraint are the bounds 
x; > &; or x; <u; in which a; is te;, the ith coordinate vector, and it is particularly 
simple to handle such constraints. It is important that algorithms should take this 
into account. 

The most difficult type of constrained minimization problem is nonlinear 
programming (Chapter 12) in which there exist some non-linear constraint func- 
tions in the problem. In this case a completely satisfactory general purpose method 
has yet to be agreed upon and the subject is one of intense research activity. Of 
course if the non-linear constraints can be rearranged so as to be eliminated directly 
then this should be done, but is not usually possible. Indirect elimination by solving 
a system of equations numerically is possible (Sections 7.2 and 12.4) but is not 
usually efficient and other difficulties exist; this idea is closely related to another 
approach known as a feasible direction method. A different approach is to attempt 
to transform the problem to one of unconstrained minimization by using penalty 
functions (Sections 12.1, 12.2, and 14.3). Efficiency depends on exactly how this 
is done, but it seems inevitable that some sort of penalty function must be used to 
get good global convergence properties. In many algorithms the iteration is deter- 
mined by modelling the original problem in a suitable way. In particular a 
linearization of the constraint functions is often used. This is a first order Taylor 
series approximation about the current iterate x"): 


c(x™ + 8) ~ (8) =c8) +075, = f= 1,2,...,p. (Ges) 


The linearized function gi*) is defined in terms of the correction 6 to x), the 
superscript k indicating that it is made on iteration k. This approximation enables 
linear constraint subproblems to be solved on each iteration. As in Volume 1, it is 
also possible to make a quadratic model of the objective function. However to take 
constraint curvature correctly into account it is appropriate to modify the quad- 
ratic term in a suitable way. Methods of this type are very important, although to 
obtain good global properties they must be incorporated with some type of penalty 
function (Sections 12.3 and 14.3). A special case of nonlinear programming is 
geometric programming in which the functions f and c; have a polynomial type 
structure. It is possible to reduce this problem to a linear constraint problem which 
is more readily solved (Section 13.2). Often linear constraints and bounds arise in 
nonlinear programming problems: it is usually possible to take advantage of this 
fact to make the algorithm more efficient. 

These algorithms, in particular those for nonlinear programming, depend on a 
study of optimality conditions for problem (7.1.1), and this theory is set out in 
Chapter 9. In Volume 1, I tried to write parts of the book in simple terms, avoiding 
the use of too much theory. To some extent I have done this here, for example in 
the presentation of linear and quadratic programming. However constrained 
minimization problems are much more complex than unconstrained problems and 
it is important for the user to have some grasp of this theory. This is especially true 
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in regard to Lagrange multipliers and first order conditions and I have tried to give a 
simple semi-rigorous introduction in Section 9.1 showing how these multipliers 
arise and can be interpreted. A more rigorous presentation then follows. The same 
is true in regard to second order conditions in Section 9.3. Some simple notions of 
convexity and duality for smooth problems appear in Sections 9.4 and 9.5. The 
subject of non-differentiable optimization is arguably the most difficult that I have 
tried to cover in this book. Some presentations of this subject are extremely 
theoretical although I have tried to avoid this as much as possible. However at the 
expense of introducing a little more theory concerning optimality conditions for 
non-differentiable convex functions (Section 14.2), a reasonably elegant and not 
too difficult treatment can be given. A discussion of algorithms (Sections 14.4 and 
14.5) is also given. 


7.2 Elimination and Other Transformations 


There is considerable scope for making transformations to a constrained mini- 
mization problem which reduce it to a form which is more readily solved. This can 
be advantageous and a number of such possibilities are discussed. However it is 
important to be aware from the outset that this procedure is not entirely without 
risk and that solutions of the original and transformed problems may not correspond 
on a one-to-one basis or that methods may not perform adequately on the trans- 
formed problem. A number of examples of this are given in this section and else- 
where in the book, and the user should be on his guard. The most simple possibility 
for equality constraints is to use the equations to eliminate some of the variables in 
the problem (elimination). If there are just m equations c(x) = 0 which can be 
rearranged directly to give 


X1 = O(K2) (720) 


where x, and x, are partitions of x in IR” and IR”, then the original objective 
function f(x;, x2) is replaced by 


V(X2) = F(O(K2), X2) (7.2.2) 


and W(x) is minimized over x. without any constraints. A simple example is given 
in Question 7.3. Derivatives of y are readily obtained from those of f and c (see 
Question 7.5). Some care has to be taken to avoid an ill-conditioned rearrangement 
when forming (7.2.1), for instance with linear constraints it is advisable to use some 
sort of pivoting on the variables. In some cases the method may fail completely, as 
shown in Question 7.4. In fact it is possible to discuss elimination in more general 
terms, implicitly by first making a linear transformation of variables; this is 
described in Section 10.1. In cases where no direct rearrangement like (7.2.1) is 
available, it is possible to regard c(x,, x2) = 0 asa system of non-linear equations 
which can be solved by the Newton—Raphson method (Volume 1, Section 6.2). 

In doing this x2 remains fixed and a vector x; is determined which solves the 
equations. Thus x, depends on x2 and so the process implicitly defines a function 
X1 = O(x,). This method is outlined in a more general form in Section 12.4; 


however the process is not always the most efficient and there can be difficulties 

in getting the Newton—Raphson method to converge. An alternative transformation 
for the equality constraint problem is the method of Lagrange multipliers (Section 
9.1) in which the system of non-linear equations (9.1.5) is solved which arises from 
the first order necessary conditions. Except in special cases this system must be 
solved numerically, which renders the method of little practical use. The method 
can also fail, not only when the solution of (9.1.5) corresponds to a constrained 
maximum point or saddle point, but also when the regularity condition (9.2.4) does 
not hold (see Question 9.14). 

Elimination methods are not directly applicable to inequality constraint prob- 
lems unless the set of active constraints ./* is known. However it is possible to use 
a trial and error sort of method in which a guess .o is made at the set of active 
constraints, and constraints in © are then treated as equalities, neglecting the 
remaining inequality constraints. The resulting equality constraint problem is then 
solved by elimination or by the method of Lagrange multipliers, giving a solution 
x. It is necessary to check that x is feasible with respect to the constraints which 
have been ignored. If not, one of these is added to the active set and the above 
process is repeated. If x is feasible then it is also necessary to check that the first 
order conditions are satisfied. To do this requires the calculation of the correspond- 
ing Lagrange multiplier vector i. Since d; = 0f/de; to first order measures the effect 
of perturbations in the c; on f, it is necessary for an inequality constraint c;(x) > 0 
that A; 2 0 at the solution, for otherwise a feasible perturbation would reduce f. 
Thus if there are any r; <0, one such constraint must be removed from the active 
set and the process repeated again. On the other hand, if 14> 0 then the required 
solution is located. Methods of this type can be used in an informal way on small 
problems. However they are most useful in solving all types of linear constraint 
problem when systematic procedures can be devised. Such methods include the 
simplex method for linear programming and the active set method for all types of 
linear constraint programming. So-called exchange algorithms for best linear L, and 
L. data fitting are also examples of this type of procedure. Systematic procedures 
using active set methods for non-linear constraints based on solving the equality 
constraint problems by implicit elimination can also be devised (Section 12.4) but 
there are some difficulties which are not readily overcome. It is difficult to handle 
constraints of the form c;(x) > 0 in an active set method because the feasible region 
is not closed and the constraint cannot be active at a solution. However it can be 
useful to include them in the problem via the transformation c;(x) > € > 0, possibly 
solving a sequence of problems in which e J 0 if the constraints happen to be active. 
The reason for doing this might be to prevent or dissuade f(x) being evaluated at an 
infeasible point at which it is not defined (for example the problem: min x log, x 
subject to x > 0). It may not be satisfactory just to ignore the constraints because 
the problem may then become unbounded or have a global solution with 
c;(x) <0, which is of no interest. 

Some other transformations are worthy of note which relate equality and 
inequality constraint problems. For example a constraint c;(x) = 0 can be equiv- 
alently replaced by two opposite inequality constraints c;(x) > 0 and —c;(x) > 0. 
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This enables (7.1.1) to be reduced to an inequality constraint problem. However 
there are some practical disadvantages due to degeneracy and other reasons and the 
idea is best avoided, although it can occasionally be useful. The alternative poss- 
ibility is to write c;(x) > 0 as the equality constraint min(c;(x), 0) = 0. Unfor- 
tunately this function is not a C’ function and so is usually excluded on this count. 
Another possibility is to replace a constraint c;(x) > 0 by adding an extra variable, 
z say, giving an equality constraint c;(x) =z and a bound z > 0. The variable z is 
referred to as a slack variable since it measures the slack in the inequality constraint. 
This transformation is most useful in the simplex method for linear programming 
which requires all general inequality constraints to be handled in this way, but is 
not necessary in active set methods which treat inequalities of any type directly. 
Furthermore, following an idea introduced later in this section, it is possible to do 
away with the need for the bound z > 0. This is done by adding a quadratic slack 
variable y and replacing c;(x) > 0 by the (non-linear) equality constraint c;(x) = y?. 
This removes the need to treat inequality constraints directly. However this trans- 
formation does cause some distortion as explained below and in this case it may be 
somewhat dangerous, in particular because of the following feature. Let for example 
c(x) > 0 be the only constraint and let x’ be such that c’ = 0 and g’ = a’)’ where 
<0. Then x’ and 2X’ do not satisfy first order conditions for a solution. Yet the 
vector x’ augmented by y’ = 0 does satisfy first order conditions in the transformed 
equality constraint problem with the same \’. Thus the transformation does not 
seem to be able to distinguish whether or not constraints are active on the basis of 
first order information. I have also heard bad reports of quadratic slacks in practice 
which might well be accountable for in this way. 

Many other useful transformations arise in constrained optimization and are 
used in subsequent chapters. Perhaps the most well-known idea is the use of penalty 
functions for nonlinear programming. The idea is to transform the problem to one 
of unconstrained optimization by adding to the objective function a penalty term 
which weights constraint violations. In sequential penalty functions x* is found as 
the limit of the minimizing points of a sequence of penalty functions, as some con- 
trolling parameter is changed. More recently the value has been realized of an exact 
penalty function which has x* as its local minimizer. These transformations are 
described in some detail in Sections 12.1, 12.2, 12.5, and 14.3. Other transform- 
ations of some importance are those arising in duality (Section 9.5), integer 
programming (Section 13.1), and geometric programming (Section 13.2), amongst 
others. 

It is also possible to make transformations of variables in an attempt to simplify 
the problem. For example the bound x; > 0 can be removed by defining a new 
variable y; which replaces x;, such that x; = y?. Then for any y; in (—oo, 00) it 
follows that x; > 0 so the bound does not need to be explicitly enforced. Another 
similar transformation for £; <x; <u; is to let y; satisfy x; = 2; + (u; — &;)sin? y;. 
For strict constraints x; > 0 it is possible to use x; = e”#. The advantage of these 
transformations is that they do extend the range of problems which can be handled 
by an unconstrained minimization routine. This is not to say that minimization 
with simple bounds 2; < x; < u; is at all difficult; in fact the opposite is true and 


it is probably more efficient to treat the problem directly. It is simply that sub- 
routines which minimize functions subject only to bounds are much less readily 
available to the user at present. These ideas can also be used to transform inequality 
constraints to equalities (see above in regard to quadratic slacks), although this 
possibility should be viewed with some suspicion for the reason given above. These 
transformations do cause some distortion which often may not be favourable. For 
example the problem min x subject to x > 0, after transforming x = y?, becomes 
min y*. This has a singular Hessian at the solution which causes any standard 
minimization method based on a quadratic model to converge slowly. Another 
example is the convex programming problem min(x — 1)* subject tox > 0. 
Although the transformation is well behaved at the solution x* = 1, it induces a 
stationary point with a non-positive-definite Hessian matrix at x = 0 and both these 
features could possibly cause difficulties (see also Question 7.6). Thus although 
such transformations can be useful, the user should be aware that they are not 
entirely risk free. 

Another transformation which enables | x; | functions to be handled is to 
replace the variable x; by two non-negative variables x;" andx,; representing the 
positive and negative parts of x; (that is max(x;, 0) and max(—x;, 0)). The con- 
ditions x;* >Oandx; > 0 are explicitly included in the problem; also whenever 
x; appears in the problem it is replaced by x;* — x; and similarly | x; | is 
replaced by x;* +x; (see Question 8.12). This latter replacement is only valid 
if one of x;° or x; is zero, which can sometimes be guaranteed, for example in 
otherwise linear problems when bothx;* and x, together cannot be basic 
(Chapter 8). This transformation can also be used to handled unbounded variables 
in a linear programming prcblem. An alternative technique for handling | x; | terms 
is described in Section 8.4. These ideas can be extended to functions | c;(x) | by 
adding extra variables y; and the equality constraint c;(x) = y; (see Question 8.11), 
thus enabling L,; approximation problems to be handled by smooth techniques. 
Similar ideas for minimizing max functions or L.. functions can be tackled by 
introducing an extra variable v as described in Section 14.1. However all these 
techniques are really attempting to solve non-differentiable optimization problems 
as smooth problems. In the current state of the art this can be useful, but when 
software becomes readily available for some of the better more direct methods 
described in Sections 14.4 and 14.5, these should be preferred. 

Finally the very important transformation of scaling either the constraints or the 
variables in the problem is discussed. Scaling of a constraint set is achieved by multi- 
plying each constraint function by a constant chosen so that the value of each con- 
straint function, evaluated for typical values of x, is of the same order of magni- 
tude. This can be important in that this scales the Lagrange multipliers (inversely) 
and so can make more reliable the test on the magnitude of a multiplier which is 
used in some algorithms. A well-scaled matrix is also important in some linear 
algebra routines when pivoting tests are made. Moreover when using penalty func- 
tions which involve quantities like c'c or Ilc ll,, it is important that constraints are 
scaled. In a similar way scaling of the variables can sometimes be important. This 
again arises when pivoting tests on the variables are made, or when implicitly using 
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some norm of the variables, for example in restricted step methods or methods with 
a bias towards steepest descent (see Volume 1). In practice variables are usually 
scaled by multiplying each one by a suitable constant. However a non-linear scaling 
which can be useful for variables x; > 0 is to use the transformation x; = e”?. Then 
variables of magnitudes 10° °, 10-3, 10°, 10°... , say, which typically can occur 
in kinetics problems, are transformed into logarithmic variables with magnitudes 
which are well scaled. 


Questions for Chapter 7 


1. Calculate the Jacobian matrix V2! of the linear system (x)= A'x — b. If @ is 
obtained by linearizing a non-linear system as in (7.1.3) show that both systems 
have the same Jacobian matrix. 

2. Obtain the gradient vector and the Hessian matrix of the functions f(x) + h(c(x)) 
and f(x) + A' (x). In the latter case treat the cases both where A is a constant 
vector and where it is a function A(x). 

3. Find the solution of the problem minimize —x — y subject tox? ty? =1 by 
graphical means and also by eliminating x, and show that the same solution is 
obtained. Discuss what happens, however, if the square root which is required 
is chosen to have a negative sign. 

4. Solve the problem minimize x? + y? subject to (x — 1)° = y? both graphically 
and also by eliminating y. In the latter case show that the resulting function of 
x has no minimizer and explain this apparent contradiction. What happens if the 
problem is solved by eliminating x? 

5. Consider finding derivatives of the functions @(x) and (x2) defined in (7.2.1) 
and (7.2.2). Define partitions 


[ive _{ 81 _[{ Ai 
. (¥) z (n) : La 

and show by using the chain rule that V2! =—A,A7! and hence that 
V.W =g) — A.A; /g,. Second derivatives of w are most conveniently obtained 
as in (12.4.6) and (12.4.7) by setting V' = [0 : I] and hence Z' = [-A,A7!: I] 
andiS= =3[Aqen0|s 

6. Consider the problem minimize f(x ;, x2) subject tox, >0,x  >0 when the 
transformation x = y? is used. Show that x’ = 0 isa stationary point of the trans- 
formed function, but is not minimal if any g; <0. If g' = 0 then second order 
information in the original problem would usually enable the question of 
whether x’ is a minimizer to be determined. Show that in the transformed 
problem this cannot be done on the basis of second order information. 


Chapter 8 


Linear programming 


8.1 Structure 


The most simple type of constrained optimization problem is obtained when the 
functions f(x) and c;(x) in (7.1.1) are all linear functions of x. The resulting 
problem is known as a Jinear programming (LP) problem. Such problems have been 
studied since the earliest days of electronic computers, and the subject is often 
expressed in a quasi-economic terminology which to some extent obscures the basic 
numerical processes which are involved. This presentation aims to make these pro- 
cesses clear, whilst retaining some of the traditional nomenclature which is widely 
used. One main feature of the traditional approach is that linear programming is 
expressed in the standard form 


minimize f(x) 4 c'x 
x (8.1.1) 
subject to Ax = b, x20, 


where A is an m x n matrix, and m <n (usually <). The symbol 4 means ‘defined 
by’. Thus the allowable constraints on the variables are either linear equations or 
non-negativity bounds. The coefficients c in the linear objective function are often 
referred to as costs. An example with four variables (m = 4) and two equations 

(m = 2) is 


minimize x; + 2x, +3x3+4x4 

SUDIECIHO.X to Kok Xo ti Xa = 
x1 a5 Xs el Al 

x, 20,x, 20,x320,x,20. 


d (8.1.2) 
2 


> 


More general LP problems can be reduced to standard form without undue diffi- 
culty, albeit with some possible loss of efficiency. For instance a general linear 
inequality a!x < b can be transformed using a slack variable z (see Section 7.2) to 
the equation a'x + z= band the bound z > 0. Alternatively the dual transform- 
ation can sometimes be used advantageously to obtain a standard form and this is 
described in more detail in Section 9.5. More general bounds x; > 2; can be dealt 
with by a shift of origin, and if no bound exists at all on x; in the original problem, 
then the standard form can be reached by introducing non-negative variables x ;" 


1] 
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and x; , as described in Section 7.2. In fact very little is lost in complexity if the 
bounds in (8.1.1) are expressed as 


Q<x<u, (8.1.3) 


as Question 8.8 illustrates. The merit of using other possible standard forms is dis- 
cussed in more detail in Section 8.3. However for the most part this text will 
concentrate on the solution of problems which are already in the standard form 
(8.1.1): 

It is important to realize that a problem in standard form may have no solution, 
either because there is no feasible point (the problem is infeasible), or because 
f(x) > — for x in the feasible region (the problem is unbounded). However it is 
shown that there is no difficulty in detecting these situations, and so the text 
concentrates on the usual case in which a solution exists (possibly not unique). 

It is also convenient to assume that the equations are independent, so that they 

have no trivial linear combination. In theory this situation can always be achieved, 
either by removing dependent equations or by adding artificial variables (see Section 
8.4 and Question 8.21), although in practice there may be numerical difficulties if 
this dependence is not recognized. 

If (8.1.1) is considered in more detail, it can be seen that if m =n, then the 
equations Ax = b determine a unique solution, and the objective function c!x and 
the bounds x = 0 play no part. In most cases however m <n, so that the system 
Ax = b is underdetermined and n — m degrees of freedom remain. In particular the 
system can determine only m variables, given values for the remaining m — m 
variables. For example the equations Ax = b in (8.1.2) can be rearranged as 


Sul 

Xs oie hah Ook 
ers ae 4 (8.1.4) 
2S hae?) — 4x4 
which determines x, and x given values for x3 and x4, or alternatively as 

ed : 
X1~8~ 4X2 ~X3 

Sieg (8.1.5) 
X4-8~ 4X2 


which determines x; and x4 from x and x3, and in other ways as well. It is import- 
ant to consider what values these remaining m — m variables can take in the standard 
form problem. The objective function c'x is linear and so contains no curvature 
which can give rise to a minimizing point. Hence such a point must be created by the 
conditions x; 2 0 becoming active on the boundary of the feasible region. For 
example if (8.1.5) is used to eliminate the variables x; and x4 from the problem 
(8.1.2), then the objective function can be expressed as 


f=x1 + 2xy +3x3+4x4 = +4x> + 2x3. (8.1.6) 


Clearly this function has no minimum value unless the conditions x. > 0,x3 >0 
are imposed, in which case the minimum occurs when x = x3 = 0. An illustration is 
given in Figure 8.1.1 for the more simple conditions x; + 2x. = 1 andx, >0, 

x2 > 0. The feasible region is the line joining the points a = (0, 3)’ and b = (1, 0)", 
When the objective function f(x) is linear the solution must occur at either a or b 


13 





x| + 2x=1 


Figure 8.1.1 Constraints for a simple LP problem 


with either x; = 0 or x, = 0 (try different linear functions, for example f=x, +x 
or f=x, + 3x2). If however f(x) =x, + 2x, then any point on the line segment is 
a solution and this includes both a and b. This corresponds to the existence of a 
non-unique solution. 

To summarize therefore, a solution of an LP problem in standard form always 
exists at one particular ex treme point or vertex of the feasible region, with at least 
n — m variables having zero value, and the remaining m variables being uniquely 
determined by the equations Ax = b and taking non-negative values. This result is 
fundamental to the development of LP methods, and can be established rigorously 
using the notions of convexity (Section 9.4). The proof is sketched out in some 
detail in Questions 9.20 to 9.22. 

The main difficulty in linear programming is to find which n — m variables take 
zero value at the solution. The earliest method for solving this problem is the 
simplex method, which tries different sets of possibilities in a systematic way. This 
method is described in Section 8.2 and is still predominant today, albeit often in 
more sophisticated forms. Different variations of the method exist, depending upon 
exactly which intermediate quantities are computed. The earliest tableau form 
became superseded by the more efficient revised simplex method, both of which 
are described in Section 8.2. More recently methods based on using matrix fac- 
torizations have been suggested in order to control round-off errors more effec- 
tively. For large sparse LP problems, product form methods have enabled problems 
of up to 10° variables to be solved in practice. Both these developments are 
described in Section 8.5. An apparently different approach to LP is the active set 
method described in Section 8.3 which, however, turns out to be equivalent to the 
simplex method with slack variables, although different intermediate matrices are 
stored. The problem of calculating initial feasible points for LP and other linear 
constraint problems is described in Section 8.4. All these methods have one possible 
situation in which they can fail to solve a problem which has a well-defined solution. 
This is referred to as degeneracy and is described in Section 8.6. 


8.2 The Simplex Method 


The simplex method for solving an LP problem in standard form generates a 
sequence of feasible points x) x@)_. which terminates at a solution. Since 
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there exists an extreme point at which the solution occurs, each iterate x*) isan 
extreme point. Thus n — m of the variables have zero value at x*) and are referred 
to as nonbasic variables (the index set NV (*)) and the remaining m variables have a 
non-negative value (usually positive) and are referred to as basic variables (the index 
set B“*)). The simplex method makes systematic changes to these sets after each 
iteration, in order to find the choice which gives the optimal solution. The super- 
script (K) is often omitted for clarity. At each iteration it is convenient to assume 
that the variables are permuted so that the basic variables are the first m elements 
of x. Then x! = (xf, x#) can be written, where xp and x,y refer collectively to the 
basic and nonbasic variables respectively. The matrix A in (8.1.1) can also be 
partitioned similarly into A = [Ag : Ay] where the basis matrix Ag ism x m and 
Ay ism x (n — m). The equations Ax = b can thus be written 


[Ap: Al (38) Apxp + Ay Xy = b. (82:19 
IN 


At an extreme point it is always possible to find a partitioning into B and NV such 
that Ag is non-singular (assuming that A has full rank; see Question 9.20). Also 
since x(*) = 0 it is possible to write 


) ee -(°) (8.2.2) 


where b = Ap 'b. Since the basic variables must take non-negative values it is re- 
quired that b> 0. The partitioning B“) and N“) and the extreme point x) with 
the above properties (x) =b2 0, x) = 0, Ag non-singular) is referred to asa 
basic feasible solution (b.f.s.). An example is provided by the LP problem (8.1.2) 
with the choice B = {1, 2} and N = {3, 4} (that is the variables x; and x, are basic, 
and x3 and x, are nonbasic). Then 


Gh et oli ee 
Ap | at aw=[} Ei 
ik 
b= Ag'b= Carsiics : -(j)>0. 

{east Ne 2 


Since Ag is non-singular and b> 0, this choice of B and N determines a basic 
feasible solution. It is also of interest to know the value (f say) of the objective 
function at a basic feasible solution. Partitioning c! = (ch, chy) and using (8.2.2) 
it follows that 


fzcixQs= chb. 


and 


> 


In the example above, cp = (1, 2)" so that the objective function has the value 

f = 1}. The nomenclature ‘basic feasible solution’ can be somewhat confusing 
(basic feasible point would be better) since ‘solution’ refers to a solution of the 
constraint equations Ax = b and x > 0, and not to the overall solution of (8.1.1), 


which is therefore referred to as an optimal b.f.s. It should be noticed that not 
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every possible choice of B and N gives rise to a b.f.s. For instance in (8.1.2), 

B ={1, 3} gives a singular matrix Ag, and B = {2, 4} gives a vector b for which 

ba <0. (Note that indices such as by refer to the variable to which the element 
corresponds, and not to the position in the vector b.) Consequently the determi- 
nation of an initial b.f.s. BC), VN) and x) is a non-trivial task. Suitable methods 
exist which incorporate the logic of the simplex method, and these are discussed in 
Section 8.4. 

It is a simple matter to discover whether a basic feasible solution is optimal. 
Equation (8.2.1) is used to eliminate the basic variables from the objective function, 
yielding a reduced objective function f(xy) of the nonbasic variables only, whose 
coefficients give this information directly. For example in the LP problem (8.1.2) 
with the choice B= {1, 2}, N = {3, 4}, (8.2.1) can be rearranged as shown in 
(8.1.4) and substituted into the objective function in (8.1.1) to give 


i Sanka) Sle + 2X5 Kye (8.2.3) 
Now at the b.f.s.x3 and x, take the values x3 = x4 = 0, but in general may satisfy 
x3 >Oand x4 >0. Thus only an increase in the value of any nonbasic variable is 
allowed. Furthermore f can be decreased from its value of f = 14 by increasing x4, 
since x, has a negative coefficient in (8.2.3), and so it follows that this b.f.s. is not 
optimal. On the other hand, if the coefficients of the nonbasic variables in the 
reduced objective function are all non-negative as in (8.1.6), then the corresponding 
b.f.s. is optimal since there is no feasible change to the nonbasic variables which 
will reduce f(xy). 


In general terms the elimination of the basic variables uses the rearrangement of 
(8.2.1) given by 


Re Ap DiS Anxn) = b— Ap An xn. (8.2.4) 
The reduced objective function can then be written as 
fxn) = cBXB + CNXN 


= c}(b— Ap! Ay xy) + eh XN 


= FC Xn, (8.2.5) 
say, where the coefficients cy are defined by 
éy =cy — AWR, (8.2.6) 
and where 
n= Ap tps (8.2.7) 


In the example above m= (2, —1)" and éy = (2, —1)'. The coefficients éy are 
known as the reduced costs at the basic feasible solution. By virtue of the discussion 
in the previous paragraph, the basic feasible solution is optimal if the reduced costs 
satisfy the test 


éy > 0. (8.2.8) 
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The first part of a simplex method iteration therefore is to determine the reduced 
costs. If (8.2.8) holds then the basic feasible solution is optimal and the method 
terminates. If (8.2.8) also contains some element é, = 0, then a similar argument 
shows that x, can be increased with f staying constant, so that the solution is 
non-unique (assuming that a non-zero step is permitted — for example if b>0- 
see next paragraph and Question 8.2). 

Usually (8.2.8) is not satisfied, in which case the simplex method proceeds to 
find a new basic feasible solution which has a lower value of the objective function. 
Firstly a variable x, g EN, is chosen for which cg < 0, and f(xy) is decreased by 
increasing x, whilst the other nonbasic variables retain their zero value. Usually the 
most negative C, is chosen, that is 
although other selections have been investigated (Goldfarb and Reid, 1977). As xg 
increases, the values of the basic variables xg change as indicated by (8.2.4), in 
order to keep the system Ax = b satisfied. Usually the need to keep xp 20 to 
maintain feasibility limits the amount by which x, can be increased. In the above 
example Cq is the only negative reduced cost, so x4 is chosen to be increased, whilst 
x3 retains its zero value. The effect onx, and xz is indicated by (8.1.4), and x, 
increases like 4 + 3x4 whilst x decreases like 4 —4x4. Thusx becomes zero 
when x, reaches the value 4, and no further increase in x4 is permitted. 

In general, because xg is the only nonbasic variable which changes, it follows 
from (8.2.4) that the effect on xg is given by 


= he Ap'ag Xq 
=bt+dx, (8.2.9) 
say, where a, is the column of A corresponding to variable q, and where 
d=—Ajg'a, (8.2.10) 


can be thought of as the derivative of the basic variables with respect to changes in 
xq. In particular if any d; has a negative value then an increase inXg causes a reduc- 
tion in the value of the basic variable x;. From (8.2.9), x; becomes zero when 

ah b, ;/—d;. The amount by which Xq can be increased is limited by the first basic 
variable, x, say, to become zero, and the index p is therefore that for which 


Leper) 

caererin aie! 00 8 Oa ee 

errant war (8.2.11) 
dj;<0 


It may be, however, that there are no indices i € B such that d; <0. In this case 
f(x) can be decreased without limit, and this is the manner in which an unbounded 
solution is indicated. In practical terms this usually implies a mistake in setting up 
the problem in which some restriction on the variables has been omitted. 

In geometric terms, the increase of x, and the corresponding change to xg in 
(8.2.9) causes a move from the extreme point x*) along an edge of the feasible 
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region. The termination of this move because Xp, Pp © B, becomes zero indicates 
that a new extreme point of the feasible region is reached. Correspondingly there 
are again n — m variables with zero value (xp and the x;, i€ NO spe q), and these 
are the nonbasic variables at the new b.f.s. Thus the new sets N(**!) and Bt!) 
are obtained by replacing q by pin N Ch). and p by q in B®, respectively. The form 
of the iteration ensures that b> 0 at the new b.f.s. and it is possible to show also 
that Ag is non-singular (see (8.2.15)), so that the conditions for a b.f.s. remain 
satisfied. In the example above the nonbasic variable x4 is increased and the basic 
variable x2 becomes zero. Thus 4 and 2 are interchanged between the sets B = {1, 2} 
and NV = {3,4} giving rise to a new basic feasible solution determined by B = {1, 4} 
and N = {2, 3}. It is shown by virtue of (8.1.5) and (8.1.6) that the resulting b.f.s. is 
in fact optimal. 

With the determination of a new b.f.s., the description of an iteration of the 
simplex method is complete. The method repeats the sequence of calculations until 
an optimal b.f.s. is recognized. Usually each iteration reduces f(x), and since the 
number of vertices is finite, it follows that the iteration must terminate. There is 
one case, however, in which this may not be true, and this is when b; = 0 and 
d; <0 occurs. Then no increase in x, is permitted, and although a new partitioning 
B and N can be made as before, no decrease in f(x) is made. In fact it is possible for 
the algorithm to cycle by returning to a previous set B and N and so fail to termin- 
ate even though a well-determined solution does exist. The possibility that b; = (Ois 
known as degeneracy and the whole subject is discussed in more detail in Section 
8.6. 

For small illustrative problems (n = 2 or 3) it is straightforward to follow the 
above scheme of computation. That is given B and N, then 

Ag! = adjoint(A) (8.2.12) 
det(A) 
is calculated, and hence b = Aj‘ b. Then (8.2.7) and (8.2.6) enable the reduced 
costs ¢y to be obtained. Either the algorithm terminates by (8.2.8) or the least ¢, 
determines the index g. Computation of d by (8.2.10) is followed by the test 
(8.2.11) to determine p. Finally p and q are interchanged giving a new B and N with 
which to repeat the process. 

For larger problems some more efficient scheme is desirable which avoids the 
computation of Ag! from (8.2.12) on each iteration. The earliest method was the 
tableau form of the simplex method. In this method the data is arranged in a 
tableau: 
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Then by making row operations on the tableau (adding or subtracting multiples of 
one row (not the f row) from another, or by scaling any row (not the frow)), it is 
possible to reduce the tableau to the form: t 


(8.2.13) 





where Ay = Ag! Ay, which represents the canonical or reduced form of the LP 
problem. Essentially the process is that used in obtaining (8.1.4) and (8.2.3) from 
(8.1.2): the operations on the B rows are equivalent to premultiplying by Aj/, and 
those on the f row are equivalent to premultiplying by ch in the new B rows and 
subtracting this row from the f row. In fact an initial canonical form (8.2.13) is 
usually available directly as a result of the operations to find the initial b.f.s. 
(Section 8.4). Tests (8.2.8) and (8.2.11) can be carried out directly on information 
in the tableau (note that d = —a,) so that the indices p and q can be determined. 
The new tableau is obtained by first scaling row p so that the element in column 

q is unity. Multiples of row p are then subtracted from other rows so that the 
remaining elements in column g become zero, and column qg becomes a column of 
the unit matrix. The previous column (p) with this property becomes full during 
these operations. Then columns p and q of (8.2.13) are interchanged, giving a new 
tableau in canonical form. In fact this interchange need not take place, in which 
case the columns of the unit matrix in (8.2.13) no longer occur in the first n 
columns of the tableau. The iteration is then repeated on the new tableau. An 
example of the tableau form applied to (8.1.2) is given in Table 8.2.1. In general 
the quantity apg (= —d,) must be non-zero (see (8.2.11)) and plays the role of 
pivot in these row operations, in a similar way to the pivot in Gaussian elimination. 
In Table 8.2.1 the pivot element is circled. 

It became apparent that the tableau form is inefficient in that it updates the 
whole tableau AW at each iteration. The earlier part of this section shows that the 
simplex method can be carried out with an explicit knowledge of Aj/, and it is 
possible to carry out updating operations on this matrix. Since Aj’ is often smaller 
than Aj, the resulting method, known as the revised simplex method, is usually 
more efficient. The effect of a basis change is to replace the column ap by ag in 
AY), which can be written as the rank one change 


AG!) = Ap + (ag —ap)ed (8.2.14) 


(suppressing superscript (k)). By using the Sherman—Morrison formula (Volume 1, 
Question 3.13) it follows that 


(d + ey. 


dy 


[Atma Ap lene (8.2.15) 
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Table 8.2.1 The tableau form applied to problem (8.1.2) 








Data: 

Ke Sie 5 (see (8.2.3)) 
x4 (see (8.1.4)) 
x2 
if 4 (see (8.1.6)) 
X14 3 (see (8.1.5)) 
x2 4 

k=: ff 4 
x4 3 g 
x4 : 8 





where vb is the row of Ag! corresponding to Xp. The fact that d, #0 is implied by 
(8.2.11) ensures that ARTA) is non-singular. It is also possible to update the vectors 
rand b by using (S215); 

A disadvantage of representing Aj! in the revised simplex method is that it is 
not adequate for solving large problems. Here the matrix A is sparse, and the prob- 
lem is only manageable if this sparsity can be taken into account. Storing Ap! 
which is usually a full matrix is no longer possible. Also it has more recently been 
realized that representing Ag! can sometimes cause numerical problems due to 
magnification of round-off errors. Thus there is currently much research into other 
representations of Ag in the form of matrix factorizations which are easily invert- 
ible, numerically stable, and which enable sparsity to be exploited. These develop- 
ments are described further in Section 8.5. 


8.3 Other LP Techniques 


It has already been remarked that general LP problems can be reduced to the 
standard form (8.1.1) by adding extra variables. This section reviews other ways in 
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which a general LP problem can be tackled. One possibility is to use duality results 
to transform the problem. These are dealt with in more detail in Section 9.5 where 
three different LP duals of differing complexity are given. The resulting dual prob- 
lem is then solved by the simplex method, possibly after adding extra variables 

to give a standard form. The aim in using a dual is to generate a transformed prob- 
lem which has a more favourable structure, in particular one with fewer rows in 
the constraint matrix. Another type of transformation which can be used is known 
as decomposition in which a large structured problem is decomposed into a more 
simple master problem which is defined in terms of several smaller subproblems. 
The aim is to solve a sequence of smaller problems in such a way as to determine 
the solution of the original large problem. An example of this type of transform- 
ation is given in more detail in Section 8.5. 

A different technique, known as parametric programming, concerns finding the 
solution of an LP problem which is a perturbation of another problem for which 
the solution is available. This may occur, for example, when the effect of changes 
in the data (A, b, or c) is considered, or when an extra constraint or variable is 
added to the problem. The effect of changes in the data gives useful information 
regarding the sensitivity of various features in the model, and may often be more 
important than the solution itself. The rate of change of the values of the function 
and the basic variables with respect to b or c can be determined directly from 
b = Ap'b and f= chb (see Question 8.14). To find the effect of small finite non- 
zero changes, the aim is to retain as much information as possible from the un- 
perturbed solution when re-solving the problem. If only b or c is changed then a 
representation of Aj! is directly available. If Ag is changed then the partition into 
B and N can still be valuable. The only difficulty is that changes to b or Ag might 
result in the condition b> 0 being violated. It is important to be able to restore 
feasibility without restarting ab initio, and this is discussed further in Section 8.4. 
Once feasibility is obtained then simplex steps are taken to achieve optimality. 
Another type of parametric programming arises when extra constraints are added 
(see Section 13.1 for example) which cut out the solution. The same approach of 
taking simplex steps to restore first feasibility and then optimality can be used. 
An alternative possibility is to use the dual formulation, when extra constraints in 
the primal just correspond to extra variables in the dual. The previous solution is 
then still dual feasible, and all that is required is to restore optimality. 

Another way of treating the general LP problem 


minimize f(x) 4 c!x 
x 
subject to a; x = b;, Lek (sr) 
a; x> bj, gay f 


for finite index sets E and J, is the active set method, which devolves from ideas 
used in more general linear constraint problems (Chapters 10 and 11), and which is 
perhaps more natural than the traditional approach via (8.1.1). Similar introductory 
considerations apply as in Section 8.1, and in particular the solution usually occurs 
at an extreme point or vertex of the feasible region. (When this situation does not 
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hold, for example in minx, +x subject tox, +x = 1, then it is possible to create 
a vertex by removing undetermined variables, or equivalently by having pseudo- 
constraints x; = 0 in the active set — see Section 8.4.) Each iterate x‘*) method is 
therefore a feasible vertex defined by the equations 


alx=b;  i€a (8.3.2) 


where .¥ is the active set of n constraint indices, and where the vectors a; are 
independent. Except in degenerate cases (see Section 8.6), . is the set of active 
constraints. (x) in (7.1.2). 

Each iteration in the active set method consists of a move from one vertex to 
another along a common edge. At x“) the Lagrange multiplier vector 


1=A 'Vf=Aq'c (8.3.3) 


is evaluated, where A now denotes the n x n matrix with columns a;,i€ «. Let the 
columns of A~! be written s,,s2,...,8,. Then the construction (9.1.14) shows 
that the vectors s;,i€ MJ are the directions of all feasible edges at x"), and the 
multipliers \; are the slopes of f(x) along these edges. Thus if 


Ose 1 eo (NT (8.3.4) 


then no feasible descent directions exist, so x“) is optimal and the iteration termin- 
ates. Otherwise the most negative \;, i€ “MJ (Ag Say) is chosen, and a search is 
made along the downhill edge 


x=x%)+as,, a20. (8.3.5) 


The ith constraint has the value ch) = apx) — bj at x“) and the value c;(x)= ch) 
+ aa}sg at points along the edge. The search is terminated by the first inactive con- 
straint (p say) to become active. Candidates for p are therefore indices i € .°/ with 
a Sq <0 and any such constraint function becomes zero when 
b; — alx(* ) 
Gta 
ai Sq 


Thus the index p and the corresponding value of a are defined by 


b; — alx*) 
CL= nin a (8.3.6) 
i:iE A, aj Sg 
alsa0 


The corresponding point defined by (8.3.5) is the new vertex x(**1) and the new 


active set is obtained by replacing g by p in ./. The iteration is then repeated from 
this new vertex until termination occurs. Only one column of A is changed at each 
iteration, so A! can be readily updated as in (8.2.15). 

In fact the active set method is closely related to the simplex method, when the 
slack variables are added to (8.3.1). It is not difficult (see Question 8.17) to show 
that the Lagrange multiplier vector in (8.3.3) is the reduced costs vector in (8.2.6), 
so that the optimality tests (8.3.4) and (8.2.8) are identical. Likewise the index q 
which is determined at a non-optimal solution is the same in both cases. Moreover 
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the tests (8.3.6) and (8.2.11) which determine p are also equivalent, so the same 

interchanges in active constraints (nonbasic slack variables) are made. In fact the 
only difference between the two methods lies in how the inverse matrix infor- 
mation is represented, because the active set method updates the n x n matrix A™ } 
whereas the revised simplex method updates the m x m matrix Ag’. Thus for prob- 
lems in standard form (8.1.1) which only have a few equations (m <n), a smaller 
matrix is updated by the revised simplex method, which is therefore preferable. 
On the other hand, if the problem is like (8.3.1) and there are m > n constraints 
(mostly inequalities), then the simplex method requires many slack variables to be 
added, and the active set method updates the smaller inverse matrix and is there- 
fore preferable. 

In both cases the inefficiency of the worse method is caused by the fact that 
columns of the unit matrix occur in the matrix (A or Ag) whose inverse is being 
updated. For example if bounds x > 0 (or more generally <x <u) are recognized 
in the active set method, then the matrix A can be partitioned (after a suitable 
permutation of variables) into the form 


Pa) SOY 2) 
ee O|n—p 
. ae “Ak 


where p is the number of active bounds. Then only Aj! need be recurred, and AW} 
can be recovered from the expression 


vel ee 

AoA. alate 
In this way the active set method becomes as efficient as possible, if sparsity in the 
remaining constraint normals can be ignored. However a similar modification to the 
revised simplex method can be made when basic slack variables are present. Then 
Ag also contains columns of the unit matrix, which enables a smaller submatrix to 
be recurred. If this is done then the methods are entirely equivalent. However as a 
personal choice I prefer the description of linear programming afforded by the 
active set method, since it is more natural and direct, and does not rely on intro- 
ducing extra variables to solve general LP problems. 


8.4 Feasible Points for Linear Constraints 


In this section the method of artificial variables is described for finding an initial 
basic feasible solution for use in the simplex method. The underlying idea is of wide 
generality and it is shown that it is possible to determine a similar but more flexible 
and efficient method for use in a wide variety of situations. These include the active 
set methods for minimization subject to linear constraints described in Sections 8.3 
(linear objective function), 10.3 (quadratic objective), and 11.2 (general objective). 
The resulting methods are also very suitable for restoring feasibility when doing 
parametric programming. The idea of an L; exact penalty function for nonlinear 
constraints (Section 14.3) is also seen to be a generalization of these ideas. 
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In the first instance the problem is considered of finding a basic feasible solution 
to the constraints 


Ax=b,  x>0 : (8.4.1) 


which arise in the standard form of linear programming (8.1.1). Extra variables r 
(often referred to as artificial variables) are introduced into the problem, such that 


r=b— Ax (8.4.2) 


These variables can be interpreted as residuals of the equations Ax = b. In the 
method of artificial variables it is first ensured that b > 0 by first reversing the sign 
of any row of A and b for which b; <0. Then the auxiliary problem 


minimize 2 r; 
i 
XI (8.4.3) 
subject to Ax + r=b, x20, r20 


is solved in an attempt to reduce the residuals to zero. Let x’ and r’ solve (8.4.3); 
clearly if r' = 0 then x’ is feasible in (8.4.1) as required, whereas if r' # 0 then there 
is no feasible point for (8.4.1). (This latter observation is true since if x’ is feasible 
in (8.4.1) then x’,0 is a feasible point of (8.4.3) with Zr; = 0 which contradicts the 
optimality of r' #0.) Also (8.4.3) is a linear program in standard form with 
variables x,r for which the coefficient matrix is [A : I] and the costs are 0,e where 
e is a vector of ones. An initial basic feasible solution is given by having the r vari- 
ables basic and the x variables nonbasic so that r7) = b and x“) = 0. The initial 
basis matrix is simply Ag = I. Thus the simplex method itself can be used to solve 
(8.4.3) directly from this b.f.s. 

A feature of the calculation is that if the artificial variable r; becomes nonbasic 
then the constraint (Ax — b); = 0 becomes satisfied and need never be relaxed, that 
is r; need never become basic again. Thus once an artificial variable becomes non- 
basic, it is removed from the computation entirely, and some effort may be saved. 
If a feasible point exists (r’ = 0) the optimum basic feasible solution to (8.4.3) has 
no basic r; variables (assuming no degeneracy) so that all the artificial variables must 
have been removed. Therefore the remaining variables x! = (x, x},) are partitioned 
to give a basic feasible solution to (8.4.1). The same tableau Ay or inverse basis 
matrix Ag! is directly available to be used in the solution of the main LP (8.1.1). 
Because the solution of (8.4.3) is a preliminary to the solution of (8.1.1), the two 
are sometimes referred to as phase I and phase IJ of the simplex method respect- 
ively. In some degenerate cases (for example when the constraint matrix A is rank 
deficient) it is possible that some artificial variables may remain in the basis after 
(8.4.3) is solved. Since r = 0 nothing is lost by going on to solve the main problem 
with the artificial variables remaining in the basis, essentially just as a means of 
enabling the basic variables to be eliminated (see Question 8.21). 

An example of the method to find a basic feasible solution to the constraints in 
(8.1.2) is described. Since b > 0 no sign change is required. Artificial variables ry 
andr, (x5 and x6 say) are added to give the phase I problem 
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minimize x5 +X6 


x 
subject tox; tx. +x3 4+ X4t+Xs5 =1 
x4 +x3 —3x4 Exes 

Pein iain) ey 


A basic feasible solution for this problem is B = {5, 6}, N= {1, 2, 3, 4}, with Ag =I 
by construction. A calculation shows that ¢y = (—2, —1, —2, 2)! so that x, (¢= 1) 
is chosen to be increased. Then d= (—1, —1)! so both x, and x¢ are reduced, but 
X reaches zero first. Thus x; is made basic and x6 nonbasic. Since x. is artificial 

it is removed from the computation, and a smaller problem is solved with B = {1, 5}, 
N = {2, 3, 4}. In this case the calculation shows that x4 is increased and x5 goes to 
zero. When x4 is made basic and x5 removed, there remains the partition B = {1, 4}, 
N = {2, 3}. Both the artificial variables have been removed and so phase I is com- 
plete. This basic feasible solution is carried forward into phase II and by chance is 
found to be optimal (see Section 8.2), so no further basis changes are made. 

The way in which an artificial variable is introduced is very similar to that in 
which a slack variable is introduced (see Sections 8.1 and 7.2). In fact if a slack 
variable has already been added into an inequality a}x < b,, and if b; > 0, then it 
is possible to use a slack variable in the initial basis and not add an artificial variable 
as well. For example consider the system 


29 SP Oes) am aey Ue Xq4 <1 
xXy +x3 —-3x4 = 4, x20. 


Adding a slack variable x5 gives 


ai) AED Sy BOSS Pe UAE = JI 
xX4 +X3 =X = 4. x20 (8.4.4) 


which is in standard form. Since b; = 1 > 0 the slack variable x5 can be used in the 
initial basis, and so an artificial variable (x, say) need only be added into the second 
equation. The resulting cost function is the sum of the artificial variables only, so in 
this case the phase I problem is 
minimize xX¢ 
SUDJECEIO.X 7 2X5 tvs 4 ox, +. = 
x4 + Xai 30 +x =, x>0 


and the initial partition is B = {5,6}, N={1, 2, 3,4}. As before artificial variables 
are removed from the problem when they become nonbasic, but slack variables 
remain in the problem throughout. In this example it happens that x, is removed 

on the first iteration and the resulting partition B = {1,5}, N= {2, 3, 4} givesa 
basic feasible solution for (8.4.4), to be used in phase II. In general if b; <0 after 
adding a slack variable to an inequality ax < b;, then it is also necessary to add an 
artificial variable to that equation. This is cumbersome and can be avoided by adopt- 
ing the more general framework which follows, and which does not require any sign 
changes to ensure b> 0. 
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At the expense of a small addition to the logic in phase I, a much more flexible 
technique for obtaining feasibility can be developed (Wolfe, 1965). This is to allow 
negative variables in the basis and to extend the cost function to give the auxiliary 
problem on the kth step as 


mimmize "27 xp te Kr +t (8.4.5) 
i:x)<9 i:<o i:r)>0 
subject to Ax +r=b, x;20 Vie xe) = 0. 


In this problem it is understood that an element of the cost vector is taken as —1 if 
x{) <0 and 0 ifx{*) > 0, and likewise as —1 ifr) <0 and +1 ifr >0 so that 
the costs may change from one iteration to another. Any set of basic variables can 
be chosen initially (assuming Ag is non-singular) without the need to have b> 0, 
and simplex steps are made as before, except that (8.2.11) must be modified in an 
obvious way to allow for a negative variable being increased to zero. Once x; be- 
comes non-negative then the condition x; > 0 is subsequently enforced and arti- 
ficial variables are removed once they become zero. There is no need with (8.4.5) 
to have both artificial and slack variables in any one equation. An example is given 
in Question 8.22. In fact, because any set of basic variables can be chosen, the arti- 
ficial variable is added only as a means to enable an easily invertible matrix Ag to 
be chosen. Thus if such a matrix is already available then artificial variables are not 
required at all. 

Another consequent advantage of (8.4.5) occurs in parametric programming 
(Section 8.3) when the solution from a previous problem (B and Aj") is used to 
start phase I for the new problem. It is likely that the old basis is close to a feasible 
basis if the changes are small. However it is quite possible that the perturbation 
causes feasibility (b > 0) to be lost, and it is much more efficient to use (8.4.5) in 
conjunction with the old basis, rather than starting the phase I procedure with a 
basis of artificial variables. Also a better feasible point is likely to be obtained. For 
example in (8.4.4) if b2 is perturbed to —} and the same partition B = {1, 5}, 

N = (2, 3, 4} is used, then b = (—4, 3)" can be calculated. Thus the cost function 
in (8.4.5) is —x, and it follows that cy = (0, 1, —3)' so that x4 is chosen to be 
increased, Then d = (3, —4)! so the increase in x4 causes x, to increase towards 
zero and x5 to decrease towards zero. x, reaches zero first when x4 = §, so the new 
partition is B = {4, 5}, N= {1, 2, 3} with b = (¢, 2)". Since there are now no nega- 
tive variables, feasibility is restored and phase II can commence. 

Use of the auxiliary problem (8.4.5) is closely related to a technique for solving 
linear L; approximation problems in which Ilr ll, @ 2;| 7; |) is minimized, where 
r is defined by (8.4.2). The | 7; | terms are handled by introducing positive and 
negative variables r;* andr; , where r;* > 0,7; _ > 0, and by replacing r; =rj;° —r;_ 
and | r;|=r;' +7; as described in Section 7.2. Barrodale and Roberts (1973) 
suggest a modification for improving the efficiency of the simplex method so that 
fewer iterations are required to solve this particular type of problem. The idea is to 
allow the simplex iteration to pass through a number of vertices at one time so as 
to minimize the total cost function. The same type of modification can be used 
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advantageously with (8.4.5). Let the nonbasic variable x, (either an element of 
x or r) for which ¢g < 0 be chosen to be increased as in Section 8.2. If a variable 
x; <0 is increased to zero, then the corresponding element of cg is changed from 
—1 to 0 and é, is easily re-computed. If cg < 0 still holds then x, can continue to 
be increased without making a pivot step to change Ag’ cr Ay. Only if ¢, >O isa 
pivot step made. Likewise if a variable r; < 0 (or >0) becomes zero then the corre- 
sponding element of cg is changed from —1 to +1 (or +1 to —1) and €g is re- 
computed and tested in the same way. In this modification an artificial variable 
need not be removed immediately it becomes zero and x variables may be allowed 
to go negative if the cost function is reduced. For best efficiency it is valuable to 
scale the rows of Ax = b, or equivalently to scale the elements of the cost function in 
(8.4.5) to balance the contribution from each term (to the same order of magnitude). 
It is possible to suggest an entirely equivalent technique for calculating a feasible 
point for use with the active set method for linear programming (Section 8.3), or 
with other active set methods for linear constraints (Sections 10.3 and 11.2). The 
technique is described here in the case that EF is empty, although it is readily modi- 
fied to the more general case. The phase I auxiliary problem (Fletcher, 1970) which 
is solved at x) is 


minimize 2 (6; —a/x) 
a (8.4.6) 
subject to a}x > b,, i¢ Vv) 


where V(*) = V(x) is the set of violated (that is infeasible) constraints at x“), 
Thus the cost function in (8.4.6) is the sum of moduli of violated constraints. The 
iteration commences by making any convenient arbitrary choice of an active set 
of n constraints, and this determines an initial vertex x“!). At x“ the gradient of 
the cost function is —2;<y(*)a;, and the method proceeds as in (8.3.3) to use this 
vector to calculate multipliers and hence to determine an edge s, along which the 
cost function has negative slope. A search along x“*) + QSq is made to minimize the 
sum of constraint violations Ze yx )(b; — a}x). This search is always terminated 
by an inactive constraint (p say) becoming active, and p then replaces q in the 
active set. The method terminates when x“) is found to be a feasible point, and the 
resulting active set and the matrix A~' in (8.3.3) are passed forward to the phase II 
problem (for example (8.3.1)). 

To create the initial vertex x‘) easily in the absence of any parametric program- 
ming information, it is also possible to have something similar to the artificial vari- 
able method. In fact artificial constraints or pseudoconstraints x; = 0, i= 1,2,...,n, 
are added to the problem, if they are not already present, and form the initial active 
set (Fletcher and Jackson, 1974). Pseudoconstraints with a non-zero Lagrange multi- 
plier are always eligible for removal from the active set, and they do not contribute 
to the cost function in the line search. The value of this device is that the initial 
matrix used in (8.3.3), etc., is A= I, which is immediately available. Once a pseudo- 
constraint becomes inactive it is removed from the problem; this usually ensures 
that the pseudoconstraints are rapidly removed from the problem in favour of 
actual constraints. However pseudoconstraints with a zero multiplier are allowed to 
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remain in the active set at a solution. This can be advantageous in phase I when the 
feasible region has no true vertex, other than at large values of x which might cause 
round-off difficulties. In phase II it also solves the problem of eliminating redundant 
variables (for example as in min x; +x subject tox, +x = 1) ina systematic way. 

One final idea related to feasible points is worthy of mention. It is applicable to 
both a standard form problem (8.1.1) or to the more general form (8.3.1) used in 
the active set method. The feasible point which is found by the methods described 
in this section is essentially arbitrary, and may be far from the solution. It can be 
advantageous to bias the phase I cost function in (8.4.5) or (8.4.6) by adding to it 
a multiple vf(x) of the phase II cost function. For all v sufficiently small the solu- 
tion of phase I is also that of phase II so the possibility exists of combining both 
these stages into one. The limit vy > 0 corresponds to the phase I—phase II iteration 
above, and is sometimes referred to (after multiplying through by M = 1/v > 0) as 
the big M method. However the method is usually more efficient if larger values of 
v are taken, although if v is too large then a feasible point is not obtained. In fact it 
is not difficult to see that the method has become one for minimizing an L, exact 
penalty function for the LP problem (8.3.1), as described in Section 14.3. Thus the 
result of theorem 14.3.1 indicates that the bound vy < 1/|| A* ll... limits the choice of 
v where A* is the optimum Lagrange multiplier vector for (8.3.1). 

This type of idea is of particular advantage for parametric programming in QP 
problems (Chapter 10), because the active set .-/ from a previously solved QP prob- 
lem may have less than m elements and so cannot be used as the initial vertex in the 
phase I problem (8.4.6). However if vf(x) (now a quadratic function) is added to 
the phase I cost function in (8.4.6), then a QP-like problem results and the previous 
active set and the associated inverse information can be carried forward and used 
directly as initial data for this problem. No doubt a similar idea could also be used 
in more general linear constraint problems (Chapter 11), although I do not think 
that this possibility has been investigated. 


8.5 Stable and Large-scale Linear Programming 


A motivation for considering alternatives to the tableau method or the revised 
simplex method is that the direct representation of Ag‘ or its implicit occurrence 
in the matrix A= Az Ay can lead to difficulties over round-off error. If an inter- 
mediate b.f.s. is such that Ag is nearly singular then Ag’ becomes large and repre- 
senting this matrix directly introduces large round-off errors. If on a later iteration 
Ag is well conditioned then Az! is not large but the previous large round-off errors 
remain and their relative effect is large. This problem is avoided if a stable factor- 
ization of Ag is used which does not represent the inverses of potentially nearly 
singular matrices. Typical examples of this for solving equations Ax = b are the LU 
factors (with pivoting) or the QR factors (L is lower triangular, U, R are upper tri- 
angular, and Q is orthogonal). Some generalizations of these ideas in linear program- 
ming are described in this section. 

Also in this section, methods are discussed which are suitable for solving large 
sparse LP problems because they have some features in common. In terms of (8.1.1) 
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this usually refers to problems in which the number of equations m is greater than a 
few hundred, and in which the coefficient matrix A is sparse, that is it contains a 
high proportion of zero elements. The standard techniques of Section 8.2 (or 8.3) 
fail because the matrices such as Ag! or Ay do not retain the sparsity of A and 
must be stored in full, which is usually beyond the capacity of the rapid access 
computer store. By taking the sparsity structure of A into account it is possible to 
solve large sparse problems. There are two main types of method, one of which is a 
general sparse matrix method which allows for any sparsity structure and attempts 
to update an invertible representation of Ag which maintains a large proportion of 
the sparsity. The other type is the decomposition method which attempts to reduce 
the problem to a number of much smaller LP problems which can be solved con- 
ventionally. Decomposition methods usually require a particular structure in the 
matrix A so are not of general applicability. One practical example of this type of 
method is given, and other possibilities are referenced. It is not possible to say that 
any one method for large-scale LP is uniformly best because comparisons are 
strongly dependent on problem size and structure, but my interpretation of current 
thinking is that even those problems which are suitable for solving by decomposition 
methods can also be handled efficiently by general sparse matrix methods, so a good 
implementation of the latter type of method is of most importance. 

The earliest attempt to take sparsity into account is the product form method. 
Equation (8.2.15) for updating the matrix Ag! can be written 


[AG*)]-1 =M, [AY] —! (8.5.1) 
where M, is the matrix 


1 =dild 


M; = get (8.5.2) 


—d,/dp 1 
column p 


where d is defined in (8.2.10). Thus if AY =I, then [Ales can be represented 
by the product 


[APT =M MM, (8.5.3) 
In the product form method the non-trivial column of each Mj, j= g 2 yae5. Keds 


is stored in a computer file in packed form, that is just the non-zero elements and 
their row index. These vectors were known as eta vectors in early implementations. 
Then the representation (8.5.3) is used whenever operations with Aj’ are required. 
As k increases the file of eta vectors grows longer and operating with it becomes 
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more expensive. Thus ultimately it becomes cheaper to reinvert the basis matrix by 
restarting with a unit matrix and introducing the basic columns to give new eta 
vectors. This process of reinversion is common to most product form methods, 
and the choice of iteration on which it occurs is usually made on grounds of 
efficiency, perhaps using readings from a computer clock. 

Product form methods are successful because the number of non-zeros required 
to represent Aj! as a product can be smaller than the requirement for storing the 
usually full matrix Aj! itself. However there are reasons why (8.5.3), known as the 
Gauss—Jordan product form, is not the most efficient. Consider the Gauss—Jordan 
method for eliminating variables in a set of linear equations with coefficient 
matrix A (ignoring pivoting). This can be written as the reduction of A (= A“) to 
a unit matrix I (= AGe)) by premultiplication by a sequence of Gauss—Jordan 
elementary matrices 


I=M,-°-°M.M,A (8.5.4) 


where M,, has the form (8.5.2) with d; = - -a\e ) for all j. Another method for solving 
linear equations is Gaussian elimination, which is equivalent to the factorization 

A = LU (again ignoring pivoting), where L is a unit lower triangular matrix and U is 
upper triangular. This can also be written in product form with 


L=L,L, cil, Danae 


(8.5.5) 
Ua Ue eye Uy 
where L, and U; are the elementary matrices 
1 
1 
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1 
1 
Ue ie ee I | TOW LT (8.5.7) 


and where 2;; and uj; are the elements of the factors L and U of A. It has been ob- 
served (see various chapters in Reid (1971), for example, and Question 8.23) that 
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an LU factorization of a sparse matrix A retains much of the sparsity of A itself; 
typically the number of non-zero elements to be stored might be no more than 
double the number of non-zeros in A, especially when there is some freedom to 
do pivoting (as in reinversion). 

It is now possible to explain why the Gauss—Jordan product form is not very 
efficient. It can be shown (see Question 8.24) that the non-trivial elements below 
the diagonal in the Gauss—Jordan elementary matrices M, are the corresponding 
elements of L (with opposite sign), whereas the elements on and above the diagonal 
are the elements of U~!. Since U~! is generally full it follows that only a limited 
amount of sparsity is retained in the factors M;. A simple example of this result is 
given in Question 8.23. This analysis also shows that use of a Gauss—Jordan product 
form does not solve the problem of numerical stability. If Ag is nearly singular then 
U can also be nearly singular and U7 ! is then large. Thus the potential difficulty 
due to the presence of large round-off errors is not solved. 

In view of this situation more recent research has concentrated on trying to 
achieve the advantages of both stability and sparsity possessed by an LU decompo- 
sition. The method of Bartels and Golub (Bartels, 1971) is of this type and uses a 
combination of Gaussian elimination and row and column interchanges. Assume 
without loss of generality that the initial basis matrix has triangular factors 
AG) = LU. Then on the kth iteration AW) has the invertible representation defined 
by the expression 


(LPME i Pea (PL APO =u) (8.5.8) 


where U“) is upper triangular, Q) isa permutation matrix, and L; and P; are 
certain elementary lower triangular and permutation matrices fori=1,2,...,r. 
(Note that r refers tor), U@) = U, and QM!) = I.) An operation with Aj? is readily 
accomplished since from (8.5.8) it only requires permutations and operations with 
triangular matrices. 

After a simplex iteration the new matrix rae. is defined by (8.2.14), and 
since only one column is changed, it is possible to write 


(L,P,\L,—1P,-1)*** (iP, )L71AG* VO =s (8.5.9) 


where §S differs in only one column (¢ say) from U“) and is therefore the spike 
matrix illustrated in Figure 8.5.1. This is no longer upper triangular and is not 
readily returned to triangular form. However if the columns ¢ to n of S are cycli- 
cally permuted in the order t<t+1<---<n<t then an upper Hessenberg 
matrix H is obtained (see Figure 8.5.1). If this permutation is incorporated into 
Q™ giving Q** then (8.5.9) can be rewritten as 


(L,P,)(L-—1Pp_1) +++ LP, )L71AG? DQU*D = H. (8.5.10) 


It is now possible to reduce H to upper triangular form U“**!) in a stable manner 
by a sequence of elementary operations. Firstly a row interchange may be required 
to ensure that | H;,;| >|H;+1,;| and this operation is represented by P,,,. Then 
the off-diagonal element in column ¢ is eliminated by premultiplying by 
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L,+1 = 


where 441 ¢=Ay+4, +/H 4; (or its inverse) and is bounded by 1 in modulus. By 
repeating these operations for off-diagonals ¢ + 2,...,n, a reduction of the form 


(Lce+1)P (e+1)) > * Lye iP +1 HW = UCT) 


is obtained. Incorporating this with (8.5.10) shows that A‘ *1) has become ex- 
pressed in the form (8.5.8). 

The Bartels—Golub algorithm was originally proposed as a stable form of the 
simplex method, but because of its close relation to an LU factorization, it is well 
suited to an algorithm which attempts to maintain sparse factors in compact form. 
Reid (1975) describes such an algorithm which builds up a file of the row oper- 
ations which premultiply Atk ) in (8.5.8). The matrix U) is also stored in compact 
form in random access storage. Reid also describes a further modification in which 
additional row and column permutations are made to shorten the length of the 
spike before reducing to Hessenberg form. 

Another idea for stable LP is to update QR factors of Ag (or Af), where Q is 
orthogonal and R is upper triangular (Gill and Murray, 1973). Since Ag is part of 
the problem data and is always available, it is also attractive to exploit the possi- 
bility of not storing Q, by using the representation 


Ap l= RerQl= Ret RalApy 
Operations with R~! or R7! are carried out by backward or forward substitution. 
Also, since AB Ap = R'R, when Ag is updated by the rank one change (8.2.14), a 
rank two change in AfAzp is made and hence R can be updated efficiently by using 


the methods described by Fletcher and Powell (1975) or Gill and Murray (1978a). 
A form of this algorithm suitable for large-scale LP in which the factor is updated 
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Figure 8.5.1 Spike and Hessenberg matrices in the Bartels— 
Golub method 
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in product form is given by Saunders (1972). Many other algorithms for stable or 
sparse LP have been suggested which often incorporate similar ideas of using tri- 
angular or orthogonal factorizations together with permutations to promote a 
favourable matrix structure. A number of recent developments are described by 
Gill and Murray (1974b). Comparisons between the algorithms are difficult to 
make, in particular because they depend critically on the size and structure of the 
LP problem itself, but some results are given by Reid (1975). 

A different approach to sparsity is to attempt to find a partial decomposition 
of the problem into smaller subproblems, each of which can be solved repeatedly 
under the control of some master problem. The decomposition method of Dantzig 
and Wolfe (1960) illustrates the principles well, and is described here although some 
other possibilities are described by Gill and Murray (1974b). The idea is of limited 
applicability since it depends on the main problem having a favourable structure, 
and even then it is not clear that the resulting method is better than using a general 
sparse matrix LP package. A well-known situation which might be expected to 
favour decomposition is when the LP problem models a system composed of a 
number of quasi-independent subsystems. Each subsystem has its own equations 
and variables, but there are a few additional equations which interrelate all the 
variables and may represent, for example, the distribution of shared resources 


throughout the system. Let each subsystem (for i= 1, 2,...,r) be expressed in 
terms of n; variables x; and be subject to the conditions 
C;x; =b;, x; 20, (8.5.11) 


where C; and b; have m; rows. Also let there be mp additional conditions 


i 
2 B;x; = bo (8.5.12) 


a 


_ 


so that bo and the matrices B; have mg rows. Then the overall problem can be ex- 
pressed in terms of an LP problem of the form (8.1.1) with 


r Ti 
n= 2 nj, We get ae 
t=1 i=0 
ee 7 Tie: ae lire bee 
Res (Kyo Kop os ek oe b = (bg, bd), 


and the coefficient matrix has the structure 


(85.13) 





33 


The Dantzig—Wolfe decomposition method is applicable to this problem and 
uses the following idea. Assume for each i that (8.5.11) has p; extreme points which 
are columns of the matrix E;, and that any feasible x; can be expanded as a convex 
combination 


GS E,9;, e'@, = ie 0,> 0 (8.5.14) 


of these extreme points (see Section 9.4), where e denotes a vector of ones. This 
assumption is valid if the feasible region of (8.5.11) is bounded for each i, although 
in fact the assumption of boundedness is unnecessary. Then the main problem can 
be transformed to the equivalent LP problem 


minimize f! @ 


(3-5:15) 
subject to G@=h, 0>0 
where 
6, f ms 
0, f, 1 
8 = : t= , h = 
0, f, 1 
G, G G, G, 
G= € fil 
‘ : et Ore, A 


e 


and where G; = B,E, and f) = c}E,. This problem has many fewer equations 

(mo +r) but very many more variables (2 p;). However it turns out that it is poss- 
ible to solve (8.5.15) by the revised simplex method without explicitly evaluating 
all the matrices E;. 

Consider therefore an iteration of the revised simplex method for (8.5.15). There 
is no difficulty in having available the basis matrix Gg, which has mg + r columns 
(one from each subsystem). Each column is calculated from one extreme point. As 
usual this matrix determines the values h = Gj'h of the basic variables @,, and 
h > 0. However the reduced costs of the nonbasic variables defined by 


fn ="fy = Gat 


where ™= Gp! fp are not available since this would require all the remaining ex- 
treme points to be calculated. Nonetheless the smallest reduced cost can be calcu- 
lated indirectly. Let 7 € N index a variable 0; in the subvector 6;, and let m be 
partitioned into m! = (u!, v') where v hasr elements. Then by definition of f; 
and G,, the corresponding reduced cost fj can be written 


fj =j — g}u 5 U; = (c} =— u'B)§ — U; (8.5.16) 
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where g; is the corresponding column of G; and 5) is the corresponding extreme 
point of (8.5.11). Consider finding the smallest f fcr all indices j in the subset 7; 
this is equivalent to finding the extreme point of (8.5. 11) which minimizes (8.5.16). 
Since the solution of an LP problem occurs at an extreme point, this is equivalent 
to solving the LP 


minimize (c; — Bi) x0; 
Xj (85.17) 


subject to C;x; = b;, x; 2 0. 


Thus the smallest reduced cost for all j is found by solving (8.5.17) for all i = 1, 2, 

,r so as to give the extreme point (§, say) which for all 7 gives the smallest cost 
Pirates value in (8.5.17). Thus the smallest reduced cost lie is determined and the 
simplex method can proceed by increasing the variable 6, 

If the method does not terminate, the next step is to find the basic variable 0, 
which first becomes zero. Column gq of G is readily computed using g, = Bi 2 
where i now refers to the subsystem which includes the variable g. Thus d in 
(8.2.10) can be calculated so that the solution of (8.2.11) determines @,, in the 
usual way. The indices p and q are interchanged, and the same column of G enables 
the matrix Gp to be updated using (8.2.15), which completes the description of the 
algorithm. 

One possible situation which has not been considered is that the solution to 
(8.5.17) may be unbounded. This can be handled with a small addition to the 
algorithm as described by Hadley (1962). Another feature of note is that the sub- 
problem constraints (8.5.11) can be any convex polyhedron and the same develop- 
ment applies. A numerical example of the method in which this feature occurs is 
given by Hadley (1962). 

As a final remark on large-scale LP, it is observed that the idea which occurs in 
the Dantzig—Wolfe method of solving an LP without explicitly generating all the 
columns in the coefficient matrix, is in fact of much wider applicability. It is poss- 
ible to model some applications by LP problems in which the number of variables 
is extremely large, and to utilize some other feature to find which reduced cost is 
least. An interesting example of this is the ship scheduling problem; see Appelgren 
(1971), for example. 


8.6 Degeneracy 


It is observed in Section 8.2 that the simplex method converges when each iteration 
is known to reduce the objective function. Failure to terminate can arise when an 
iteration occurs on which b, = 0 and d, <0. Then no change to the variable Xq is 
permitted, and hence no decrease in f. Although a new partitioning B and N can be 
made, it is possible for the algorithm to return to a previous set B and N and hence 
to cycle infinitely without locating the solution. The situation in which b; =0 
occurs is known as degeneracy. The same can happen in the active set method when 
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ax b; for some indices i € .o. If af Sg <0 then these constraints allow no 
progress along the search direction s,. Although changes in the active set can be 
made, the possibility of cycling again arises. This is to be expected since the algor- 
ithms are equivalent as described in Question 8.17, and so the examples of degener- 
acy, etc., described below apply equally to both the simplex and active set methods. 
Degeneracy can arise in other active set methods (Sections 10.3 and 11.2) and the 
remarks in this section also apply there. 

An example in which cycling occurs has been given by Beale (see Hadley, 1962) 
and is described in Question 8.25. For cycling to occur, a number of elements 
b; = 0 must occur, and the way in which ties are broken in (8.2.11) is crucial. 
Beale’s example uses the plausible rule that p is chosen as the first such p which 
occurs in the tableau. (This is not the same as the first in numerical order but is a 
well-defined rule.) In practice, however, degeneracy and cycling are rarely observed 
and it has become fashionable to assert that the problem can be neglected, or can 
be solved by making small perturbations to the data. On the other hand, I doubt 
that many LP codes monitor whether degeneracy or near degeneracy arises, so it 
may be that the problem arises more frequently and goes undetected. Graves and 
Brown (1979) indicate that on some large sparse LP systems, between 50 and 90 
per cent of iterations are degenerate. I have also come across degeneracy on two or 
three occasions in quadratic programming problems. 

It is possible to state a rule (Charnes, 1952) for resolving ties in (8.2.11) which 
does not cause cycling, and hence which in theory resolves the degeneracy problem. 
(In practice this may not be the case because of round-off error difficulties, a sub- 
ject which is considered further at the end of this section.) Consider perturbing the 
right hand side vector b to give b(€) where 


b(e)=b+ 2 e/a;. (8.6.1) 
j=1 


Multiplying through by Aj! and denoting A = Aj!A gives 
A a n . 
be)=b+ Dc (8.6.2) 
j=l 


where a; is the jth column of A. This perturbation breaks up the degenerate situ- 
ation, giving instead a closely grouped set of non-degenerate extreme points. The tie 
break rule (which does not require € to be known) comes about by finding how the 
simplex method solves this perturbed problem for sufficiently small ¢. Each element 
in (8.6.1) or (8.6.2) is a polynomial of degree n in e. The following notation is 
useful. If 


u(eé)=uUg tuyet--- tune” 


is any such polynomial, v(e) similarly, then u(e) > v(e) for all sufficiently small 
e > Oif and only if 
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Ug > Uo 
or Ug = Vo and Uy > vy 
or Uy = Vo; Uy =v, and Uy > V2, (8.6.3) 
or Ug Uo, UU eee Oren and 0 


and u(e) = v(e) iff u; =v; Vi=0, 1,...,”. Dantzig, Orden, and Wolfe (1955) 
let (8.6.3) define a lexicographic ordering u > v of the vectors u and v whose 
elements are the coefficients of the polynomial and show that the perturbation 
method can be interpreted in this way. I shall use the notation u(e) >v(e) to mean 
that u(e) > v(e) for all sufficiently small € > 0. 

Assume for the moment that the initial vertex satisfies b|(e) > 0 (that is 
bo Ye) >O VIE B)). The rule for updating the tableau is 

pet) = 400 74 xp 
@pq (8.6.4) 


pet) ie as 5), 
4pq 
Also ag = —d by definition, so apg = —dy > 0 from (8.2.11). Thus if be) +0, 
then both b**)(e) > 0 and b(K*1)(€) +0 for all i : dj > 0, i # p. Now let (8.2.11) 
be solved in the lexicographic sense 





b b; 
one = min ie) (8.6.5) 
—4p (SB —d; 

dj;<0 


It follows using a;, = —d; that 


bi(e) >= KE). 
[pq 

(Equality is excluded because this would imply two equal rows in the tableau, 
which contradicts the independence assumption on A.) Hence from (8.6.4), 
b** (ec) +0 fori: d; <0. Thus be) > 0 implies b* (ce) > 0, and so induc- 
tively the result holds for all k. Thus the points obtained when the simplex method 
is applied to the perturbed problem for sufficiently small € are both feasible and 
non-degenerate. It therefore follows that f*)(e) > f**!)(e), and so the algorithm 
must terminate at the solution. There is no difficulty in ensuring that b“)(e) > 0. 
If the initial b.f.s. is non-degenerate then this condition is implied by bh“) > 0. If 
not, then there always exists a unit matrix in the tableau, and the columns of the 
tableau should be rearranged so that these columns occur first, which again implies 
that b“)(e) >0. 

The required tie break rule in (8.2.11) is therefore the one obtained by solving 
(8.6.5) in the lexicographic sense of (8.6.3). Thus in finding min b,(€)/—d, 
Vie B,d;<0, the zero order terms b;/—d; are compared first. If there are any 
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ties then the first order terms a;,/—d; are used to break the ties. If any ties still 
remain, then the second order terms @;/—d; are used, and so on. The rule clearly 
depends on the column ordering and any initial ordering for which bU )(e) > 0 may 
be used (see Question 8.26). However once chosen this ordering must remain fixed, 
and the rearrangement of the tableau into B and N columns in (8.2.13) must be 
avoided (or accounted for). The method assumes that the columns stay in a fixed 
order and that the columns of I may occur anywhere in the tableau. A similar 
method (Wolfe, 1963b) is to introduce the virtual perturbations only as they are 
required, and is particularly suitable for practical use. 

An equivalent rule for breaking ties in the active set method can be determined 
by virtue of Question 8.17. The equivalent tableau is given by 





Zh) ieee ASAT (8.6.6) 
Say aY “a —--—" 
x Z2 Z\ 


and if degeneracy occurs in some inactive slacks z; then ties are resolved by exam- 
ining the corresponding ith rows of [I: —AJ Aj ‘] in accordance with some fixed 
ordering of the slack variables. 

An alternative way of resolving degeneracy exists which avoids iterating the full 
LP a number of times as in the perturbation method. Instead a smaller non-degener- 
ate LP is solved to determine a search direction s in the nonbasic variables which 
moves away from the degenerate vertex. Let Ay denote the partition of the matrix 
Ay whose rows correspond to degenerate basic variables. Consider changing each 
x;, 1 € N, by an amount as,, so that s is the required search direction. Since s is 
required to reduce the objective function, it is normalized to have negative slope by 
chs = —1. Feasibility of xy requires that s > 0, and feasibility of the degenerate 
variables requires that ADs = 0. Any vector s which is feasible in the above con- 
ditions can be chosen, so the freedom is taken up by minimizing Ils ll,. Thus s 
solves the problem 


minimize e's 
s 
subject to —¢Vs =] (8.6.7) 
—Aps= 0, s20, 


where e is a vector of ones. If no solution exists then the degenerate vertex is opti- 
mal. This problem is also degenerate (at s = 0), but its dual problem 


maximize Ag 
ho. A (8.6.8) 
subject to —¢yXy — AB A<e, A>0 


(see Section 9.5) is generally not. This problern has only d + 1 variables where d is 
the number of degenerate variables, and can usually be solved readily by the active 
set method (see Question 8.27). 

A similar approach can be taken to resolving degeneracy in the active set method. 
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In this case a feasible direction s requires that a/s > 0 for all i € ALK) = 

{22 alx(*) = b;}. A downhill direction is obtained by normalizing cls=-1 

(c = Vf; see (8.3.1)) and the remaining freedom is taken up by minimizing IIs ll,, 
giving the problem 


minimize lls ll, 
: (8.6.9) 
subject to c's = —1 
als>0, pe gh) 


Again if no solution exists then x“*) is optimal. Also (8.6.9) is degenerate, but its 
dual problem 
minimize Xo 
V,AQ,A (8.6.10) 
subject to v=cAg tAA 
lv lle <1, A120 


(Watson, 1978) is generally not, and can be solved. It may be that solving problems 
like (8.6.8) or (8.6.10) offers a more rapid way of resolving a degenerate vertex and 
so this type of possibility should be kept in mind as an alternative to the pertur- 
bation method. However it is not easy to incorporate these subproblems into an LP 
package, especially if degeneracy in these subproblems is allowed for (introducing a 
recursive aspect), which presumably can occur in theory. 

In practice, however, cycling is not the only difficulty caused by degeneracy, 
and may not even be the most important. I have also observed difficulties due to 
round-off errors which can cause an algorithm to break down. The situation is most 
easily explained in terms of the active set inethod and is illustrated in Figure 8.6.1. 
There are three constraints in IR? which have a common line and are therefore 
degenerate. Initially constraint 1 is in /, and the search along s“!) approaches the 
degenerate vertex x‘). The tie concerning which constraint to include in .° is 
resolved in favour of constraint 2. The next search is along s‘?) with constraints 1 


@ 


® 


Figure 8.6.1 A practical difficulty 
caused by degeneracy 
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and 2 active; all appears normal and cycling does not occur at x‘?). What can 
happen, however, is that due to round-off error, the slack in constraint 3 is calcu- 
lated as ~e (in place of zero) and the component of —a3 along s‘?) is also ~e in 
place of zero. Thus constraint 3 can become active at a spurious vertex x@) with a 
value of a ~ 1 in (8.3.6). The algorithm thus breaks down because x) is not a true 
vertex, and the matrix of active constraints A in (8.3.3) is singular. An equivalent 
situation arises in the simplex method when true values of b; and —d; are zero, but 
are perturbed to ~e by round-off error. Then the solution of (8.2.11) can give a 
spurious b.f.s. in which the matrix Ag is singular. I have observed this type of 
behaviour on two or three occasions in practice when applying a quadratic program- 
ming package, and the only remedy has been to remove degenerate constraints from 
the problem. It may not be easy for the user to do this (or even to recognize that it 
is necessary) although it is certainly important for him or her to be aware of the 
possibility of failure on this account. Clearly further research is required to deter- 
mine a suitable modification to linear constraint algorithms which is simple and 

yet effective in controlling all the ill effects of a degenerate problem. 


Questions for Chapter 8 
1. Consider the linear program 


minimize —4x, — 2x2 —-— 2x3 
subject to: 3x4 +) x27 46x35 = 12 
Ss ee See x3 =-8, x20. 


Verify that the choice B={1, 2}, N ={3} gives a basic feasible solution, and 
hence solve the problem. 

2. Consider the LP problem of Question 1 and replace the objective function by 
the function —4x, — 2x3. Show that the basic feasible solutions derived from 
B= {1,2}, N =({3} and B = {2, 3}, N= {1} are both optimal, and that any 
convex combination of these solutions is also a solution to the problem. 

In general describe how the fact that an LP problem has a non-unique sol- 
ution can be recognized, and explain why. 

3. Sketch the set of feasible solutions of the following inequalities: 


cee 0) ate 0 
x4 + 2x7 <4 
—X;, + esl 
Ket Ks = St 


At which points of this set does the function x, — 2x2 take (a) its maximum 
and (b) its minimum value? 
4. Consider the problem 
maximize x 
subject to 2x, +3x2 <9 
|x; -2|S<1, x20. 


(i) Solve the problem graphically. 
(ii) Formulate the problem as a standard LP problem. 
. Use the tableau form of the simplex method to solve the LP problem 


MMIMIZe OX, = OX — 3X3 

subject to. 2x4 oX>.— Xa. 
3x; — 8x, +2x3 <4 
IN7 — 1X5 t 3X35 259, x20 


after reducing the problem to standard form. 

. A manufacturer uses resources of material (m) and labour (2) to make up to 
four possible items (a to d). The requirements for these, and the resulting 
profits are given by 


item a requires 4m + 22 resources and yields £5/item profit, 
item b requires m + 5Q resources and yields £8/item profit, 
item c requires 2m + & resources and yields £3/item profit, 
item d requires 2m + 38 resources and yields £4/item profit. 


There are available up to 30 units of material and 50 units of labour per day. 
Assuming that these resources are fully used and neglecting integrality con- 
straints, show that the manufacturing schedule for maximum profit is an LP 
problem in standard form. Show that the policy of manufacturing only the 
two highest profit items yields a b.f.s. which is not optimal. Find the optimal 
schedule. Evaluate also the schedule in which equal amounts of each item are 
manufactured and show there is an under-use of one resource. Compare the 
profit in each of these cases. 
. Use the simplex method to show that for all u in the range —3 < p< —4, the 
solution of the problem 

minimize ux;—- XxX, 

subject to x, +2x, <4 

Ox7at 2x4 9} x20 


occurs at the same point x“ and find x7 and x3. 
. Consider modifying the revised simplex method to solve a more general stan- 
dard form of LP, 


minimize c!x 
subject to Ax = b, R<x<u. 
Show that the following changes to the simplex method are required. Non- 
basic variables take the value x(*) = 2; or u;, and b= Ag! (b— Ay x) is used 
to calculate b. The optimality test (8.2.8) is changed to require that the set 


Ly RS Ie c; <Oif x;=8;, ¢; >0 if x; = u;} 


is empty; if not cg is chosen to min | ¢; | for i in this set. The choice of p in 
(8.2.11) is determined either when a variable in B reaches its upper or lower 


10. 


BL 


12. 
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bound or when the variable Xq teaches its other bound. In the latter case no 
changes to B or N need to be made. Write down the resulting formula for the 
choice of p. 


. Using the method of Question 8, verify that B={1, 2},N ={3, 4}, xy =0 


gives a b.f.s. for the LP problem 


minimize Xx, rae 
subject to X,,+X7— X34 + 2x4 =6 
Pe Tene) 2X4 =2 


O<x, <4, O<x, <10, O<x3 <4, OXx, <2 


and hence solve the problem. 
By drawing a diagram of the feasible region, solve the problems 


minimize xX 

subject tox, tx, = 1 
XP X= 2 
x4 20 


(a) when x > 0 also applies and (b) when x, is unrestricted. Verify that the 
solutions are correct by posing each problem in standard form, determining 
the basic variables from the diagram, and then checking feasibility and opti- 
mality of the b.f.s. In case (b) introduce variables x * andx 7 as described 
in Section 7.2. 

A linear data fitting problem can be stated as finding the best solution for x of 
the system of equations Ax = b in which A has more rows than columns (over- 
determined). In some circumstances it can be valuable to find that solution 
which minimizes ©; | r; | where r= Ax — b, rather than the more usual Dr? 
(least squares). Using the idea of replacing r; by two variables r;* andr; , one 
of which will always be zero, show that the problem can be stated as an LP 
(not in standard form). 

Consider the problem 


minimize |x |+|y|+|z| 
subject to. x 2) = 3 
SP term | 
X,Y, Z unrestricted in sign. 


Define suitable non-negative variables x*,x ,y',y , etc., and write down an 
LP problem in standard form. Verify that the variables x* andz’ provide an 
initial basic feasible solution, and hence solve the LP problem. What are the 
optimum function value and variables in the original problem? Is the solution 
unique? 

Consider the problem in Question 12. Without adding extra variables other 
than a slack variable, show that the problem can be solved (as in (8.4.5)) by 
allowing the coefficients in the cost function to change between +1 as the 
iteration proceeds. 
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17. 


Given an optimal b.f.s. to a problem in standard form, consider perturbing the 
problem so that 


(i) bQ)=b +2 
(ii) cy(A) = ey + AS 
(iii) cg(A)=cg +Xt 


where in each case J is increased from zero. At what value of A, if any (in each 
case), will the present solution no longer be an optimal b.f.s.? Consider the 
solution of Question 6 and the effect of (i) a reduction in the availability of 
labour, (ii) an increase in the profit of item a, and (iii) a decrease in the profit 
of item c. At what stage (in each case) do these changes cause a change in the 
type of item to be manufactured in the optimal schedule? 

Use the active set method to solve the LP problem 


maximize xX, + Xp 
subject to3x, + x, 23 
x, + 4x, 24, x20. 


Illustrate the result by sketching the set of feasible solutions. 
Solve the LP problem 


minimize —x,; + x2 
subject to —x, + 2x, <2 


x <3, x20 


both graphically and by the active set method. Correlate the stages of the active 
set method with the extreme points of the feasible region. 

Consider the LP problem (8.3.1) with EF empty, and compare its solution by the 
active set method as against the simplex method. By introducing slack variables 


z, write the constraints in the simplex method as [—A! : I] c = —b and 


z= 0. Use a modification of the simplex method in which the variables x are 
allowed to be unrestricted (for example as in Question 8 with u; = —2; = K for 
sufficiently large K). Choose the nonbasic variables as the slack variables z,, 

i € &, where .of is the active set in the active set method. Partition 
A=[A,:A)], b? =(b}, bd), and z? = (1, z3), corresponding to active and 
inactive constraints at x‘*) in the latter method. The quantities in the active 
set method are described in Section 8.3. In the simplex method, show that 
BT = (KOT, 20T), pT = (—cT AZT, 07), and éy = Aj c= A in (8.3.3) and 
hence conclude that the same index q is chosen. Hence show that the vector 
d is given by d! = Ge si Ao) (see (8.6.6)). Since there are no restrictions on 
the x variables, show that test (8.2.11) becomes 
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i:t¢. S —Aj Sg 
T 
a; Sq<0 


18. 


ios 


20. 


2. 


rips 


23: 


24. 
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which is (8.3.6). Hence conclude that the same index p is chosen, and there- 
fore that the sequence of values x“*) and z“) in both methods is identical. 
Use the method of artificial variables to find a b.f.s. for the LP problem in 
Question 1. 

Find a basic feasible solution for the LP problem 


AUNIMIZE —X_ + Xo +i Xs — 2X4 
subject to 4x; +x2—2x3+3xq,g< 8 
xy a DEY Gy Naea =; x20 


by the method of artificial variables, and hence solve the problem. Is the 
inequality constraint active at the solution? 
Convert the LP problem 


minimize X,;+ x2+2x3 

subject toy 74 X5it x5 9~ 
2X4 = 3X + 3x3 =1 

> 


—3x, + 6x, —4x3 = x20 


to standard form, and solve by a combined phase I—phase II simplex method, 
using the tableau form. 
Convert the LP problem 


minimize —x; —-xX 2 


subject to x, +2x3 <1 
32) = oa S|! 
x4 PSD) 30 x3 =2, x20 


to standard form and find an initial b.f.s. by the method of artificial variables. 
Show that an artificial variable remains in the resulting basis at zero value. Why 
is this so? Show that the solution to the LP problem can nonetheless be 
obtained. 

Convert the LP problem 


minimize 2x, + 4x. +3x3 
subject to x;—- X2- x38 
204 + x42 > 


to standard form. Solve using the simplex method 

(i) starting with a basic feasible solution for which x; = Ae = ONG Sek, 
(ii) using (8.4.5) with a basis of slack variables. 

Compute LL! factors of the tridiagonal matrix A in which A,, = 1 and Aj = 2 
and A; ;_, =A;_1,;=—1 V i> 1. Observe that no sparsity is lost in the 
factors, but that the inverse matrix L~! is always a full lower triangular matrix 
with a simple general form. 

Consider the Gauss—Jordan factors (8.5.4) of a matrix A. Express each M; as 
M; = U;L; where 


play 


26. 


1 1 —d,|d; 
hes = = «ek 
L; = : o U; = : if ’ 
Oe 1/d; ] 
=d,,|d; 1 1 
L_ column I io column i 


and show that L;U; = U; L; for j < i. Hence rearrange (8.5.4) to give 


or 
ah ee a 105 

By comparing this with A = LU where L and U are given in (8.5.5), show that 

the spike in L; is the ith column of L, but that the same is not true for U, since 


the spikes in U; in (8.5.7) occur horizontally. However observe from the last 
equation that 


U-} =U...) ---U;, 


in which the spikes in the U; are the columns of U~!. This justifies the assertion 
about the inefficiency of the Gauss—Jordan product form in Section 8.5. 
Consider the LP problem due to Beale: 


minimize 3X4 “fi 20x —*X. + 6x4 
subject t0..4x%, =. 8X2 — x3 + 9x40 
4x, -—12x>-4x3 + 3x4 <0 
x3 <1, x20. 


Add slack variables x5, x6, x7 and show that the initial choice B = {5, 6, 7} is 
feasible but non-optimal and degenerate. Solve the problem by the tableau 
method, resolving ties in (8.2.11) by choosing the first permitted basic variable 
in the tableau. Show that after six iterations the original tableau is restored so 
that cycling is established. (Hadley, 1962, gives the detailed tableaux for this 
example.) 

For the example of Question 25, show that initial column orderings in the 
tableau (5, 6, 7, 1, 2, 3, 4} and {1, 2, 3, 4, 5, 6, 7} both have b“)(e) >0. In 
the first case show that the tie break on iteration | is resolved by p = 6, which 
is different from the choice p = 5 made in the cycling iteration. Hence show 
that degeneracy is removed on the next iteration which solves the problem. In 
the second case show that the cycling iteration is followed for the first two 
iterations, but that the tie on iteration 3 is broken by p = 2 rather than by 

p = 1. Again show that degeneracy is removed on the next iteration which 
solves the problem. This example illustrates that different iteration sequences 


2H: 


45 


which resolve degeneracy can be obtained by different column orderings in the 
perturbation method. 

Set up subproblem (8.6.8) as a means of resolving degeneracy in the example 
of Question 25. Show that a solution is given by Ay = 8, Ay = 0, and Ay = 3. 
Show that this corresponds to a solution in which sz = s4 = 0 and find s, 

and s3. 


Chapter 9 
The Theory of Constrained Optimization 


9.1 Lagrange Multipliers 


There have been many contributions to the theory of constrained optimization. In 
this chapter a number of the most important results are developed; the presentation 
aims towards practicality and avoids undue generality. Perhaps the most important 
concept which needs to be understood is the way in which so-called Lagrange multi- 
pliers are introduced and this is the aim of this section. In order to make the under- 
lying structure clear, this is done in a semi-rigorous way, with a fully rigorous treat- 
ment following in the next section. In Volume 1 it can be appreciated that the 
concept of a stationary point (for which g(x) =V f(x) = 0) is fundamental to the 
subject of unconstrained optimization, and is a necessary condition for a local 
minimizer. Lagrange multipliers arise when similar necessary conditions are sought 
for the solution x* of the constrained minimization problem (7.1.1). 

For unconstrained minimization in Volume 1, the necessary conditions illustrate 
the requirement for zero slope and positive curvature in any direction at x™, that is 
to say there is no descent direction at x”. In constrained optimization there is the 
additional complication of a feasible region. Hence a local minimizer must be a 
feasible point, and other necessary conditions illustrate the need for there to be no 
feasible descent directions at x*. However there are some difficulties on account of 
the fact that the boundary of the feasible region may be curved. In the first instance 
the simplest case of only equality constraints is presented (that is [=D (the empty 
set)). 

Suppose that a feasible incremental step 6 is taken from the minimizing point 
x”. By a Taylor series 


c(x* +8)=c} + 6! a* +0(N68ll) 


where a; = Vc; and o(-) indicates terms which can be ignored relative to 6 in the 
limit. By feasibility c;(x* + 8) = c; = 0, so that any feasible incremental step lies 
along a feasible direction s which satisfies 


sla; = 0; ViEE. (9.1.1) 


In a regular situation (for example if the vectors a;, i € E, are independent), it is 
also possible to construct a feasible incremental step 6, given any such s (see Section 
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9.2). If in addition f(x) has negative slope along s, that is 
Sete 0) (9.1.2) 


then the direction s is a descent direction and the feasible incremental step along s 
reduces f(x). This cannot happen, however, because x” is a local minimizer. There- 
fore there can be no direction s which satisfies both (9.1.1) and (9.1.2). Now this 
statement is clearly true if g* is a linear combination of the vectors a;, i €E, that 
is if 

Co a (9.1.3) 

iGE 

and in fact only if this condition holds, as is shown below. Therefore (9.1.3) isa 
necessary condition for a local minimizer. The multipliers ee in this linear combi- 
nation are referred to as Lagrange multipliers, and the superscript * indicates that 
they are associated with the point x*. A* denotes the matrix with columns ae 
i € E. Notice that there is a multiplier associated with each constraint function. 
In fact if A* has full rank, then from (9.1.3) 2* is defined uniquely by the 
expression 


1% aS A*tg* 


where A* = (A7A)~!A? denotes the generalized inverse of A (see Question 9.15). 
Of course when there are no constraints present, then (9.1.3) reduces to the usual 
stationary point condition that g* = 0. 

The formal proof that (9.1.3) is necessary is by contradiction. If (9.1.3) does 
not hold, the g* can be expressed as 


g =A*Atp (9.1.4) 


where » #0 is the component of g* orthogonal to the vectors a;, so that A*! p= 0. 
Then s = — pSatisfies both (9.1.1) and (9.1.2). Hence by the regularity assumption 
there exists a feasible incremental step 6 along s which reduces f(x), and this contra- 
dicts the fact that x” is a local minimizer. Thus (9.1.3) is necessary. The conditions 
are illustrated in Figure 9.1.1. At x' which is not a local minimizer, g’ # a‘) and so 
there exists a non-zero vector #t which is orthogonal to a’, and an incremental step 
& along the feasible descent direction — preduces f(x). At x”, g* =a"\* and no 
feasible descent direction exists. A numerical example is provided by the problem: 
minimize f(x) 4 x, +x subject to c(x) 4 x} — x2 = 0. In this case x* = (4, 4)" 
so that g* = (1, 1)' and a*= (1, —1)!, and (9.1.3) is thus satisfied with \*=—1. 
The regularity assumption clearly holds because a* is independent; more discussion 
about the regularity condition is given in the next section. The use of incremental 
steps in the above can be expressed more rigorously by introducing directional 
sequences, as in the next section, or by means of differentiable arcs (Kuhn and 
Tucker, 1951). However the aim of this section is to avoid these technicalities as 
much as possible. 

These conditions give rise to the classical method of Lagrange multipliers for 
solving equality constraint problems. The method is to find vectors x* and A* 


48 


; bh cat) N 


c(x)=0 





Contours of f( x ) 


Figure 9.1.1 Existence of Lagrange multipliers 


which solve the equations 


g(x) = 2 aj(x)r; 
iE (9.1.5) 


c;(x) = 0, Lek 


which arise from (9.1.3) and feasibility. If there are m equality constraints then 
there are n + m equations and n + 7m unknowns x and A, so the system is well deter- 
mined. However the system is non-linear (in x) so in general may not be easy to 
solve (see Volume 1), although this may be possible in simple cases. An additional 
objection is that no second order information is taken into account, so (9.1.5) is 
also satisfied at a constrained saddle point or maximizer. The example of the pre- 
vious paragraph can be used to illustrate the method. There are then three equations 
in (9.1.5), that is 


25 
(1) =(24)a 
ee =X =0. 


These can be solved in turn for the three variables \* = —1, x} = —4, andx> =4. It 
is instructive to see how the method differs from that of direct elimination. In this 
case x = x7 is used to eliminate x. from f(x) leaving f(x,)=x, +x? which is 
minimized by xj = —}. Then back substitution gives x5 = 4. 

It is often convenient to restate these results by introducing the Lagrangian 
function 


Lf (x,A)=f(x)— Xjc; (x). (9.1.6) 
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Then (9.1.5) becomes the very simple expression 


VAx, 1") =0 (9.1.7) 
where 
= {Vx 
v= se 
(e) OMS) 


is a first derivative operator for the n + m variable space. Hence a necessary con- 
dition for a local minimizer is that x*, A* is a stationary point of the Lagrangian 
function. 

An alternative way of deriving these results starts by trying to find a stationary 
point of f(x) subject to c(x) = 0, that is a feasible point x* at which f(x* +8) = 
f(x*) + o(|l 6 ll) for all feasible changes 8. The method of Lagrange multipliers finds 
a stationary point of (x, 1) by solving (9.1.7) or equivalently (9.1.5). Since 
V,£ = 0 it follows that x” is feasible. Since V,. L(x, 4*) = 0 it follows that 
L(x, 0*) is stationary at x* for all changes 6, and hence for all feasible changes. 
But if 6 isa feasible change, Y(x* + 6, A*) = f(x* +8) and so f(x) is stationary at 
x” for all feasible changes 6. Thus if (9.1.7) can be solved, this solution x” is a con- 
strained stationary point of f(x). The opposite may not always be true; it is possible 
(for example Questions 7.4 and 9.14) that x* can be a minimizer but not satisfy 
(9.1.7). Also the examples with f(x) 4 +(x — 1)° + y and the same constraint show 
that x* = (0, 1)" can be a constrained stationary point but not satisfy (9.1.7). 

To get another insight into the meaning of Lagrange multipliers, consider what 
happens if the right hand sides of the constraints are perturbed, so that 


c;(x) =6;, elie (9.1.9) 


Let x(€), A (€) denote how the solution and multipliers change as € changes. The 
Lagrangian for this problem is 


L(x, A, €) = f(x) — = Aj(c;(x) — €;). 


From (9.1.9), f(x(e)) = F(x(E), A(E), €), so using the chain rule and then (9.1.7) it 
follows that 


df dY ox! oat af 
= = y+ yy et 
de; de; 0€; x 06; A 0€; 


Thus the Lagrange multiplier of any constraint measures the rate of change in the 
objective function, consequent upon changes in that constraint function. This 
information can be valuable in that it indicates how sensitive the objective function 
is to changes in the different constraints: see Question 9.4 for example. 

The additional complication of having inequality constraints present is now dis- 
cussed. It is important to realize that only the active constraints 4 * (see (7.1.2)) 
at x* can influence matters. Denote the active inequality constraints at x" by [ c 
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(= &* 0 J). Since c} = 0 and c;(x* + 6) >0 fori € J", then any feasible incremen- 
tal step 6 lies along a feasible direction s which satisfies 


sJa, 20, Viel* (9.1.11) 


in addition to (9.1.1). In this case it is clear that there is no direction s which satis- 
fies (9.1.1), (9.1.11), and (9.1.2) together, if both 
g = LY ay (9.1.12) 
iE * 
and 
hy 2.0, ie I* (9.1.13) 


hold, and in fact only if these conditions hold, as is shown below. Hence these are 
therefore necessary conditions for a local minimizer. The only extra condition 
beyond (9.1.3) in this case, therefore, is that the multipliers of active inequality 
constraints,must be non-negative. These conditions can be established by conira- 
diction when the regularity assumption is made that the set of normal vectors ay 
i€ &*, is independent. Equation (9.1.12) holds as for (9.13). Let (9.1.13) not hold 
so that there is some multiplier iS <0. It is always possible to find a direction s 

for which s' a= 0,1 € o*,i#p, and sap = 1 (for instance s = I aela ys where 

At =(A™A) 1A? denotes the generalized inverse of A). Then s is feasible in (9.1.1) 
and (9.1.11), and from (9.1.12) it follows that 


s'g*=slapdp =A, <0. (9.1.14) 


Thus s is also downhill, and by the regularity assumption there exists a feasible 
incremental step 8 along s which reduces f(x); this contradicts the fact that x” isa 
local minimizer. Thus (9.1.13) is necessary. Note that this proof uses the indepen- 
dence of the vectors a;, i € %™, in constructing the generalized inverse. A more 
general proof using Lemma 9.2.4 (Farkas’ lemma) and its corollary does not make 
this assumption. 
The need for condition (9.1.13) can also be deduced simply from (9.1.9) and 
(9.1.10). Let an active inequality constraint be perturbed from c;(x) = 0 to 
c;(x) = €; > 0, i € 1*. This induces a feasible change in x(e) so it is necessary that 
f(x(e)) does not decrease. This implies that df*/de; > 0 at the solution and hence 
Aj > 0. Thus the necessity of (9.1.13) has an obvious interpretation in these terms. 
As an example of these conditions, consider the problem 


minimize f(x)4—x, —x, 
subject to c,(x) 4x, —x? >0 (9.1.05) 
BAC He mere 10) 
which is illustrated in Figure 7.1.3. The solution is x* = (1/,/2, 1/\/2)" so ¢, is not 
active and hence .* = {2}. Then g* = Clo) andase (AV 25/2) is0 
(9.1.12) and (9.1.13) are satisfied with A> = 1/./2 > 0. It is important to remember 


that a general inequality constraint must be correctly rearranged into the form 
c;(x) 2 0 before the condition A; > 0 applies. 
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In fact the construction of a descent direction when ne <0 above indicates 
another important property of Lagrange multipliers for inequality constraints. The 
conditions sta; = 0,i #p, and SHED = | indicate that the resulting feasible incre- 
mental step satisfies c;(x* +8) = 0 for i# p, and Cc (x* + 8) > 0. Thus it indicates 
that f(x) can be reduced by moving away from the boundary of constraint p. This 
result also follows from (9.1.10) and is of great importance in the various active 
set methods (see Section 7.2) for handling inequality constraints. If conditions 
(9.1.13) are not satisfied then a constraint index p with p <0 can be removed 
from the active set. This result is also illustrated by the problem in the previous 
paragraph. Consider the feasible point x’ ~ (0.786, 0.618) to three decimal places, 
at which both constraints are active. Since g’ = (—1, —1)!, a = (—1.572, 1), and 
ay = (—1.572, —1.236)!, it follows that (9.1.12) is satisfied with 4’ = (0.096, 
0.732)! . However (9.1.13) is not satisfied, so x’ is not a local minimizer. Since 
X, <0 the objective function can be reduced by moving away from the boundary 
of constraint 1, along a direction for which s'a; = 1 and s!a3 = 0. This is the 
direction s = (—0.352, 0.447)! and is in fact the tangent to the circle at x’. Moving 
round the arc of the circle in this direction leads to the solution point x* at which 
only constraint 2 is active. 

A further restatement of (9.1.12) and (9.1.13) is possible in terms of all the 
constraints rather than just the active ones. It is consistent to regard any inactive 
constraint as having a zero Lagrange multiplier, in which case (9.1.12), (9.1.13), 
and the feasibility conditions can be combined in the following theorem. 


Theorem 9.1.1 (First order necessary conditions) 


If x* is a local minimizer of problem (7.1.1) and if a regularity assumption (9.2.4) 
holds at x*, then there exist Lagrange multipliers 4* such that x*, * satisfy the 
following system: 


V,L(x,A)=0 
c;(x) =0, Lear 
c;(x)=0, FETE (9.1.16) 
A; 20, rey 


AC; (x) =0 Vii. 


These are often described as Kuhn—Tucker (KT) conditions (Kuhn and Tucker, 
1951) and a point x” which satisfies the conditions is sometimes referred to as a 
KT point. The regularity assumption (9.2.4) is implied by the vectors af, 1€ oA, 
being independent and is discussed in detail in the next section where a more 
rigorous proof is given. The final condition Ajc; = 0 is referred to as the comp- 
lementarity condition and states that both \j* and cj cannot be non-zero, or 
equivalently that inactive constraints have a zero multiplier. If there is no 7 such 
that Aj =; = 0 then strict complementarity is said to hold. The case hy =; =0 
is an intermediate state between a constraint being strongly active and being 
inactive, as indicated in Figure 9.1.2. 
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Strongly active Weakly active Inactive 
AF >0, G*=0 ¥ = o* =O A* =0, 6* >0 
/ 


Figure 9.1.2 Complementarity 


So far only first order (that is first derivative) conditions have been considered. It 
is also possible to state second order conditions which give information about the 
curvature of the objective and constraint functions at a local minimizer. This subject 
is discussed in Section 9.3. It is also possible to make even stronger statements when 
the original problem is a convex programming problem, and the more simple results 
of convexity and its application to optimization theory are developed in Section 
9.4. For certain convex programming problems it is possible to state useful alterna- 
tive (dual) problems from which the solution to the original (primal) problem can 
be obtained. These problems involve the Lagrange multipliers as dual variables, 
much in the way that they arise in the method of Lagrange multipliers. The sub- 
ject of duality is discussed further in Section 9.5. In fact the literature on convex- 
ity and duality is very extensive and often very theoretical, so as to become a 
branch of pure mathematics. In this volume I have attempted to describe those 
aspects of these subjects which are of most relevance to practical algorithms. 


9.2 First Order Conditions 


In this section the results of the previous section are considered in more technical 
detail. First of all it is important to have a more rigorous notion of what is meant 
by a feasible incremental step. Consider any feasible point x’ and any infinite 
sequence of feasible points {x*)} > x' where x“ # x’ for all k. Then it is poss- 
ible to write 


x*)_ x! = 5(O) Vk (9.2.1) 


where 5“*) > 0 is a scalar and s‘*) is a vector of any fixed length o > 0 (IIs Il, = 0). 
It follows that 5“*) > 0. A directional sequence can be defined as any such sequence 
for which s*) > 5. The limiting vector s is referred to as a feasible direction, and 
F(x') or F' is used to denote the set of feasible directions at x'. Taking the limit in 
(9.2.1) corresponds to the notion of making a feasible incremental step along s. 
Clearly the length of s is arbitrary and it is possible to restrict the discussion to any 
fixed normalization, for example o = 1. In some texts F’ is defined so that it also 
includes the zero vector, and is then referred to as the tangent cone at x’. 

The set F' is not very amenable to manipulation, however, and it is convenient 
to consider a related set of feasible directions which are obtained if the constraints 
are linearized. The linearized constraint function is given by (7.1.3) so clearly the 
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set of feasible directions for the linearized constraint set can be written as 
F(x’) = F' = {s|s #0, sta; =0, LCE, (9.2.2) 
sta;>0, iE}. 


where I’ = /' 1 J denotes the set of active inequality constraints at x’. (A vector 
s € F' corresponds to a directional sequence along a half-line through x’ in the 
direction s, which is clearly feasible, and it is straightforward to contradict feasi- 
bility in the linearized set for any directional sequence for which s ¢ F’.) It is very 
convenient if the sets F’ and ¥’ are the same, so it is important to consider the 
extent to which this is true. 


Lemma 9.2.1 
EDF" 
Proof 


Let s€ ¥': then Ja directional sequence x) + x’ such that s*) > 5, A Taylor 
series about x’ using (9.2.1) gives 


. 
cx) = c+ 5M) al + 06). 
Now c;(x) = c} = 0 for i € E and ¢;(x?) > c} = 0 for i EI’, so dividing by 
5) > 0 it follows that 
1 
s(*) a! + 0(1) =0, i€E 
7 
s(*)" a’ + o(1) > 0, i€l'. 
Taking limits as k > 00, s‘*) > s, o(1) > 0, then s € F’ from (9.2.2). O 
Unfortunately a result going the other way (¥' 2 F’) is not in general true and 


it is this that is required in the proof of Theorem 9.1.1. To get round this difficulty, 
Kuhn and Tucker (1951) make an assumption that F’ = ¥', which they refer 





Figure 9.2.1 Failure of constraint qualification 
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to as a constraint qualification at x'. That is to say, for any s € F’, the existence of 

a feasible directional sequence with feasible direction s is assumed. They also give 

an example in which this property does not hold which is essentially that illus- 

trated in Figure 9.2.1. Clearly at x’ = 0, the direction s= (—1, 0)! is in F’ but there 
is no corresponding feasible directional sequence, and sos ¢ F ". However it is 
important to realize that this failing case is an unlikely situation and that it is usually 
valid to assume that F’ = ¥'. Indeed this result can be guaranteed if certain linearity 
or independence conditions hold, as the following result shows. 


Lemma 9.2.2 
Sufficient conditions for F' = ¥' at a feasible point x' are either 


(i) the constraints i € .' are all linear constraints or 
(ii) the vectors a;, i € A’, are linearly independent. 


Proof 


Case (i) is clear by definition of F'’. For case (ii) a feasible directional sequence with 
feasible direction s is constructed for any s€ F’. Let s€ F’, and consider the non- 
linear system | 


r(x, 0)=0 (9.2.3) | 
defined by | 

r(x, 0) = ¢;(x) — Os‘ ai, Ee Pet nt) 

r(x, 0)= (x — x')'b, — 6s" b;, i=mtl,...,n 


where it is assumed that.’ = {1, 2,...,m}. The system (9.2.3) is solved by x’ 
when @ = 0, and any solution x is also a feasible point in (7.1.1) when 6 > 0 is 
sufficiently small. Writing A = [a,,...,a,,] and B= [b,,4,,...,b,] then the 
Jacobian matrix J(x, 0) = V1! (x, 6) = [A : B]. Since A’ has full rank by case (ii), 
it is possible to choose B so that J’ is non-singular. Hence by the implicit function 
theorem (Apostol (1957) for example) there exist open neighbourhoods Q, about 
x’ and Q¢g about @ = 0 such that for any 6 € Qg, a unique solution x(@) € Q,, exists 
to (9.2.3), and x(@) is a €’ function of 9. From (9.2.3) and using the chain rule 

dr; 5 Or; dx; or; 


+ — 
dd j dx; do 30 





so that 


ibs 

Oa ase 

eee | 

Thus dx/d0 = s at 6 = 0. Hence if 0 | 0 is any sequence then x‘*) = x(0°*)) is a 
feasible directional sequence with feasible direction s. O 
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In moving on to discuss necessary conditions at a local solution it is convenient 
to define the set of descent directions 


G(x') = F' = {| s"g' <0}. 


Then the most basic necessary condition is the following. 


Lemma 9.2.3 


csh5 AN : 5-5 
If x” is alocal minimizer, then F* \AQ* =D (no feasible descent directions). 


Proof 


Let s€ ¥* so there exists a feasible sequence x‘*) + x* such that s(*) 5. By a 
Taylor series about x”, 


f(x) = f* + 5M" g* + 0(6). 


Because x” is a local minimizer, f(x‘*)) > f* for all k sufficiently large, so dividing 
by 6) >0, 


. 
s(*) 9* + o(1)>0. 


In the limit, s“*) > s, o(1)>0sos¢Q* and hence F* NY* =. O 


Unfortunately it is not possible to proceed further without making a regularity 
assump tion 


| el ae oe Dia ol (9.2.4) 


This assumption is clearly implied by the Kuhn—Tucker constraint qualification 
(F* = F*)at x*, but (9.2.4) may hold when F* = F* does not, for example at 
x* = 0 in the problem: minimize x subject to the constraints of Figure 9.2.1. Also 
the problem: minimize x, subject to the same constraints, illustrates the need for a 
regularity assumption. Here s = (—1,0)' € F* 0 & at x* =0so this set is not 
empty and in fact x* is not a KT point, although it is a minimizer and FO D* is 
empty. 

With assumption (9.2.4) the necessary condition from lemma 9.2.3 becomes 
F* \ ¥* =@ (no linearized feasible descent directions). It is now possible to relate 
this condition to the existence of multipliers in (9.1.12) and (9.1.13). In fact the 
construction given after (9.1.13) can be used to do this when the set a;, iE WS, is 
independent. However the following lemma shows that the result is more general 
than this. 


Lemma 9.2.4 (Farkas’ lemma) 
Given any vectors a,, @2,..., 4, and g then the set 
S={s15 ¢<0, (9.2.5) 
S420. Hdly2ccaysn} (9.2.6) 
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is empty if and only if there exist multipliers \; 2 0 such that 


m ‘ 
gad) adj. (9.2.7) 
fan 


Remark 


In the context of this section, S is the set of linearized feasible descent directions 
for a problem with no equality constraints. The ex tension to include equality con- 
straints is made in the corollary. 


Proof 


The ‘if’ part is straightforward since (9.2.7) implies that s'g = Xs'a;; > 0 by 
(9.2.6) and A; = 0. Thus (9.2.5) is not true and so S is empty. The converse result 
is established by showing that if (9.2.7) with \; = 0 does not hold, then there is a 
vector s€©S. Geometrically, the result is easy to visualize. The set of vectors 


m 
C={v|v= > aj\j, A; = OF} 
i=1 


is known as a polyhedral cone and is closed and convex (see Section 9.4). From 
Figure 9.2.2 it is clear that if g €C then there exists a hyperplane with normal 
vector s which separates C and g, and for which s'a; > 0, i=1,2,...,m, and 
s'g <0. Thus s €S exists and the lemma is proved. For completeness, the general 
proof of the existence of a separating hyperplane is given below in Lemma 

OD: a) 







Polyhedral 
cone 


Separating 
hyperplane 


g¢C 


Figure 9.2.2 Existence of a separating 
hyperplane 
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Corollary 
The set 
S={s|s'g* <0 
Taf =0,i€E 
Ta; >0,i€1*} 


is empty if and only if there exist multipliers \}' such that (9. 1. 12) and (9.1.13) 
hold. 


Proof 


mnoune superscript *, then st a; = 0, i € F, can be written as st a; = 0 and 
—s'a,>0,i€ E. Then by Farkas’ Wald eh this is equivalent to the fact that there 
exist non-negative multipliers A;, i € es dA; , 2 € B, such that 


g= 2 aA; + mR + 2 —ajd;. 
ieI* hed 3, 


But defining A; = A; — d;_, i € E, gives (9.1.12) and (9.1.13) so the corollary is 
proved. 0 


Lemma 9.2.5 (Existence of a separating hyperplane) 


There exists a hyperplane s'x = 0 which separates a closed convex cone C and a 
non-zero vector g€ C. 


Proof 


By construction. Consider minimizing | g — x |IZ for all x EC, and let x, EC. Since 
the solution satisfies |l g — x ll, < ll g — x, ll, it is bounded and so by continuity of 
| - ll, a minimizing point, g say, exists. Because Ag € C for all A> 0, and because 

ll Ag — g I$ has a minimum at d = 1, it follows by setting d/dd = 0 that 


g' (g — g)=0. (9.2.8) 
Let x € C; then g + 6(x — g) €C for 6 € (0, 1) by convexity, and hence 
lax —g)+g—giB>lg—g¢l3. 
Simplifying and taking the limit 6 1 0 it follows that 
0<(x—8)'@—s)=x'G-s8) 
from (9.2.8). Thus the vector s = g — g is such that _ x > 0 for all x EC. But 


g's=s's from (9.2.8), and s# 0 since g¢ C. Thus g's<0 and hence the hyper- 
plane s'x separates C and g. O 


Geometrically the vector s is along the perpendicular from g to C. In Figure 
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9.2.2 & would be a multiple of the vector as and the resulting hyperplane (not the 
one illustrated) would touch the cone along as. 

It is now possible to bring together the various aspects of this section in proving 
that the first order conditions (9.1.12) and (9.1.13) (or equivalently the conditions 
(9.1.16) of theorem 9.1.1) are necessary at a local minimizing point x, At x™ there 
are no feasible descent directions (¥* A “* = D) by lemma 9.2.3. A regularity 
assumption (9.2.4) is made that there are no linearized feasible descent directions. 
Then by the corollary to Farkas’ lemma it follows that (9.1.12) and (9.1.13) hold. 
Thus these results have been established in a quite rigorous way. 


9.3 Second Order Conditions 


A natural progression from the previous section is to examine the effect of second 
order (curvature) terms in the neighbourhood of a local solution. It can readily be 
seen for unconstrained optimization (Volume 1) that the resulting sufficient con- 
dition that G* is positive definite has significant implications for the design of 
satisfactory algorithms, and the same is true for constrained optimization. It is 
important to realize first of all that constraint curvature plays an important role, 
and that it is not possible to examine the curvature of f(x) in isolation. For example 
realistic problems exist for which G* is positive definite at a Kuhn—Tucker point 
x*, which is not, however, a local minimizer (see (9.3.6) below). As in Section 9.1, 
it is possible to present the essence of the situation in a fairly straightforward way, 
to be followed by a more general and more rigorous treatment later in the section. 
It is assumed that f(x) and ¢;(x) for all iare €? functions. 

Suppose that there are only equality constraints present, and that a local sol- 
ution x” exists at which the vectors a; i € E, are independent, so that a unique 
vector of multipliers A* exists in (9.1.3). Under these conditions a feasible in- 
cremental step 8 can be taken along any feasible direction s at x”. By feasibility 
and (9.1.6) it follows that f(x* + 8) =L(x* + 8,4). Also since Y is stationary at 
x*, A* (equation (9.1.7)), a Taylor expansion of L(x, 24") about x* enables the 
second order terms to be isolated. Hence 


f(x* +8)=L(x* + 8,4") 
= L£(x",2") + BTV, L(x*,A*) + 487TW*S + 0(87 8) 
= f* + 46™W* $+ 0(8'8) (9.3.1) 


where W* = V2 (x*, 4 *) = V7f(x*) — DA; Vc; (x*) denotes the Hessian matrix 
with respect to x of the Lagrangian function. It follows by the minimality of f*, 
and taking the limit in (9.3.1), that 


s'W*s> 0. . (9.3.2) 
As in Section 9.1 a feasible direction satisfies 

a; 1s=0, i€E (9.3.3) 
which can be written in matrix notation as 


Ateae (9.3.4) 











nS) 


Thus a second order necessary condition for a local minimizer is that (9.3.2) must 
hold for any s which satisfies (9.3.4). That is to say, the Lagrangian function must 
have non-negative curvature for all feasible directions at x*. Of course when no 
constraints are present then (9.3.2) reduces to the usual condition that G* is posi- 
tive semi-definite. 

As for unconstrained optimization in Volume 1, Section 2.1, it is also possible 
to state very similar conditions which are sufficient. A sufficient condition for an 
isolated local minimizer is that if (9.1.3) holds at any feasible point x™* and if 


s'W*s>0 (9:35) 


for all s (# 0) in (9.3.4), then x* is an isolated local minimizer. The proof of this 
makes use of the fact that (9.3.5) implies the existence of a constant a > 0 for 
which s! W*s > as's for all s (€ 0) in (9.3.4). Then for any feasible step 6, (9.3.1) 
holds, and if 6 is sufficiently small it follows that f(x* +8) >f*. No regularity 
assumptions are made in this proof. 

A simple but effective illustration of these conditions is given by Fiacco and 
McCormick (1968). Consider the problem 


minimize f(x) 44((x, — 1)* + x3) 


255.0 
subject to c(x) 4 —x, + x3 =0 ( ) 


where £ is fixed, and examine for what values of B is x* = 0a local minimizer. The 
cases 6 = } (x* is a local minimizer) and 6 = 1 (not a minimizer) are illustrated in 
Figure 9.3.1. Atx*, g* =a* = 1, 0)! so the first order conditions (9.1.3) are 
satisfied with \* = 1, and x” is also feasible. The set of feasible directions in (9.3.4) 


is s= (0, s2)! for all s; #0. Now W* = ls 1 wu 13 | s' W*s = s3(1 — 28). Thus 


c(x)=0 for B= Yq c(x)=0 for B=1 
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Figure 9.3.1 Second order conditions 


60 


the second order necessary conditions are violated when 6 > 4, in which case it can 
be concluded that x* is not a local minimizer. When B <4 then the sufficient con- 
ditions (9.3.5) and (9.3.4) are satisfied and it follows that x* is a local minimizer. 
Only when f = 3 is the result not determined by the second order conditions. This 
corresponds to zero curvature of ¥ existing along a feasible direction, so that higher 
order terms become significant. 

An important generalization of these conditions is to allow inequality constraints 
to be present. Now second order conditions are only operative along feasible 
stationary directions (s'g* = 0) and not along ascent directions. If an inequality 
constraint c;(x) > 0 is present, and if its multiplier is dj > 0, then feasible directions 
for which s'a¥ > 0 are ascent directions (see the discussion regarding equation 
(9.1.14)). Thus usually the stationary directions will satisfy 


sia; =0, ic J* (9.3.7) 


and second order necessary conditions are that (9.3.2) holds for all s in (9.3.7). 
Another way of looking at this is that if x* solves (7.1.1) locally, then it must solve 


minimize f(x) 
subject to c;(x) = 0, ie w* (9.3.8) 


locally, and these conditions follow from (9.3.2) and (9.3.3). For sufficient con- 
ditions, if x” is feasible, if (9.1.12) and (9.1.13) hold, and if 4 >0 Vie I* 

(KT conditions with strict complementarity), then (9.3.5) for all s #0 in (9.3.7) is 
a sufficient condition for an isolated local minimizer. Alternatively positive curva- 
ture can be assumed on a larger subspace in which the conditions corresponding to 
dj = 0 are excluded. These results are justified below. An illustration of these con- 
ditions is also given by problem (9.3.6) if the constraint is changed to read c(x) > 0. 
A feasible direction s can then have any s; <0. However because \* = 1 > 0, these 
directions are uphill unless s; = 0 and hence stationary feasible directions are given 
by s= (0,82)! as in the equality constraint case. Thus the same conclusions about 
6 can be deduced. 

It is not difficult to make the further generalization which includes the possi- 
bility that there exists a \j = 0, i € ™, in which case a stationary direction exists 
for which sa; > 0. It is also possible to allow for non-unique dj in the necessary 
conditions. A rigorous derivation of the second order conditions is now set out 
which includes these features. Given any fixed vector A", it is possible to define the 
set of strictly (or strongly) active constraints 


MP={i|i EE ord; >0} (9.3.9) 


which is obtained by deleting indices for which ey =0,i €1* from *. Consider 
all feasible directional sequences x‘*) > x* for which 


c;(x“*)) =0 view* (9.3.10) 


also holds. Define ¥* as the resulting set of feasible directions. As in Section 9.2, 
consider also the set of feasible directions 
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G*={s|s#0, af™s=0, ic .w* 
aj is>0, ie ¥*_— w*} (9.3.11) 


in which the constraints which determine Y* (including (9.3.10)) are linearized. By 
an identical argument to lemma 9.2.1 it follows that G* D G*. However to state 
the second order necessary conditions a result going the other way is required, so 
another regularity assumption is made, namely that 


Gog” (9.3.12) 


Again this is a reasonable assumption to make as it is also implied by the mild con- 
ditions of lemma 9.2.2 in a similar way. 
It is now possible to state the main results of this section in their full generality. 


Theorem 9.3.1 (Second order necessary conditions) 


If x* is a local solution to (7.1.1), and if (9.2.4) holds, then there exist multipliers 
2.* such that theorem 9.1.1 is valid. For any such 2%, if (9.3.12) holds, it follows 
that 


s'W*s>0 VsEG". (9.3.13) 


Proof 


Let s€G*. Then by (9.3.12), s€ G*, and Ja feasible directional sequence with 
s“) + 5, for which (9.3.10) holds. Since either cf) = 0 for iG AF, or hj =0 
otherwise, it follows that f™ = (x), A*). Using (9.2.1) and (9.1.16), a Taylor 
series for f(x, .*) about x* gives 


P(x), r*) = L(x", ave sf pHs) Ty, L(x", 5) 
+ $50). Ws + 0(6(*)*) 


= f* + 15 (8)? (4) Swe s(*) +0(8)*), (9.3.14) 


Since x* is a local minimizer, it follows for all k sufficiently large that (os 
and hence that 


s Ws) + o(1) > 0. 


Then (9.3.13) follows in the limit. O 


Theorem 9.3.2 (Second order sufficient conditions) 
If at x* there exists multipliers x* such that conditions (9.1.16) hold, and if 
s'W*s>0 VsEG", (9.3.15) 


then x” is an isolated local solution to (7.1.1). 
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Proof 


Assume x” is not an isolated local minimizer, so that J a feasible sequence x) + x* 
such that f(*) < f*, Fixing ll 5 Il = 1, say, in (9.2.1) then this bound implies that 
J a subsequence such that s‘*) converges to s, say. By lemma 9.2.1, s€ F*, and by 
a similar argument to that in lemma 9.2.3, s'g* <0. Two cases occur, both of 
which imply a contradiction: 


(i) s€G*; then Fi: 4 >0 and aj 's > 0 whence 0 > s!g* = Ys!aj'Aj > 0. 

(ii) s € G"; by feasibility of x) L(x), 1*) << f™), so from (9.3.14) it follows 
that 0>f™ — f* =45 (6s) W*s) + 0(§()?), and dividing by 5(#)7 
and taking the limit contradicts (9.3.15). 0 


Notice that a sufficient condition for (9.3.15) is that s'W*s > 0 Vs #0 such 
that s'a = 0, i€./ 7, which is more convenient to verify in practice. 

This treatment of second order conditions owes a lot to the presentation given 
by Fiacco and McCormick (1968). However it is worth pointing out that the state- 
ment of the second order necessary conditions given here is an improvement. Fiacco 
and McCormick define feasible directions from arcs which satisfy the conditions 
c;(x) = 0, i € /*, rather than using ./ # as in (9.3.10). Although this has the ad- 
vantage of not involving 4.* and hence f, it neglects the stronger implications which 
can be made when ); = 0 in regular situations. Furthermore in a degenerate case it 
is extremely unlikely that any feasible arc will exist at all, so that the regularity 
assumption which they make (the second order constraint qualification) cannot 
usually hold. Both these objections are overcome when. ; is used. The situation 
is shown by the example 


minimize f(x) 4 x3 —4x? 

subject to cy(x) 4x3 +x, +x} >0 
€2(x) 4x3 —-x2 +x? >0 
c3(x) 4x3 20 


(9.3.16) 





Figure 9.3.2 Regularity assumptions 
for second order necessary conditions 








63 


illustrated in Figure 9.3.2. Clearly x* = 0 is not a minimizing point. Yet F* = F* 
and g* = (0, 0, 1)", ay = (0, 1, 1)", a = (0, -1, 1)", af = (0, 0, 1). All constraints 
are active and the first order conditions at x” are satisfied non-uniquely by any 
vector 4* which is a convex combination of (0, 3,4)? and (0, 0, 1)". Since, how- 
ever, there is no point other than x* which satisfies c;(x) = 0 for all i € Y*, Fiacco 
and McCormick’s second order constraint qualification does not hold, and no 
implication can be made. However consider theorem 9.3.1 using the extreme 
multiplier vectors. In both cases the set G* comprises any vector s with s; #0 and 
$2 =53 =0. For 4* =(0, 4, 4)" there is an arc (OC in Figure 9.3.2) which satisfies 
c;(x) = 0 for all i € MF, but which is not feasible inc;(x)>0,i€ 7* — v*, 
Hence (9.3.12) does not hold for this arc and so no implication can be made. 
However for 4* = (0, 0, 1)", either of the arcs OA or OB in Figure 9.3.2, or any 
intermediate arc with x3 = 0, provides a suitable arc when s, > 0 and an opposite 
arc can be used when s; <0. Then since Wy, = —1, (9.3.13) does not hold, so it 
can be concluded correctly that x* is not a local solution to problem (9.3.16). 


9.4 Convexity 


The subject of convexity is often treated quite extensively in texts on optimization. 
My experience, however, is that much of this theory contributes little to the 
development and use of optimization algorithms. Applications of convexity are 
expressed in terms of a so-called convex programming problem. Into this category 
come linear programming, certain quadratic programming problems, and some more 
general problems, more often with linear constraints. Unfortunately many real life 
problems do not fit into this category, and this is especially so when the constraint 
functions are non-linear. On the other hand, it is possible to give quite strong (and 
simple) results for a convex programming problem about the global nature of 
solutions and the sufficiency of first order conditions. Therefore a fairly simple 
treatment of convexity is given in this section, aimed mainly at establishing these 
results for smooth problems. Some extensions of convexity theory which are help- 
ful for handling non-smooth problems are given in Section 14.2. 

First of all, a convex set K in IR” is defined by the property that for all 
Xo, X; EK, it follows that xg © K where 


X9 = (1 — 9)xo + 0x, V0 € (0, 1]. (9.4.1) 


It follows from this that K can have no re-entrant corners (see Figure 9.4.1). A more 
general definition of a convex set which readily follows is that for all x9, X;,..-., 
Xm €K it follows that xg € K where 


0; = lle 6; = 0. (9.4.2) 


UM 


m 
FES PAUSE 
es 


The vector xg in (9.4.1) or (9.4.2) is referred to as a convex combination of the 
points xg, X,, etc. If xo, X1,..-, Xm isa given set of points, then the set of all 
vectors xg defined by (9.4.2) is a convex set referred to as the convex hull of the 
set of points. Examples of convex sets are many and include the empty set, a point, 
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Figure 9.4.1 Convex sets 


the whole of IR”, a line or line segment, a hyperplane (or linear equation) alx = b, 
the half-space (or linear inequality) a!x > b, the ball Il x — x’ ll, <h, a cone, and 
many others. A simple result is the following. 


Lemma 9.4.1 

If K;,i=1,2,...,m, are convex sets then the intersection K = ™7'K; is also a 
convex set. 

Proof 

Because x92 Xk, GK implies X9,.%7—] K7,1— 15 2,2... O 


This result is also illustrated in Figure 9.4.1. It shows that the feasible region in 
a linear or quadratic programming problem is a convex set because it is the inter- 
section of hyperplanes and half-spaces. 

Another useful concept is that of an extreme point of a convex set K. An ex- 
treme point x is one which may not lie interior to any line segment contained in K, 
that is x = (1 — 6)xo + 0x, for x9, x; © K, 8 € (0, 1) implies that x = xg = x,. The 
vertices of a regular polygon or any point on the circumference of a circle are 
examples of extreme points (Figure 9.4.1). Another example is the basic feasible 
solution x of the convex set K defined by the feasible region R of a linear program- 
ming problem in standard form (8.1.1). Details of the relationship between an 
extreme point and a b.f.s. are sketched out in Questions 9.20, 9.21, and 9.22. 

The other fundamental idea is that of a convex function. The discussion is 
limited to continuous functions defined on a convex set K, to eliminate trivial 
cases. Then a convex function f(x) is defined by the condition that for any 
Xo, X; €K it follows that 


fo <C — O)fo + Of; Vé€ (0, 1] (9.4.3) 


where fg refers to f(Xg), etc., and where xg is given by (9.4.1). The right hand side 
of (9.4.3) is the chord joining (xo, fo) to (x1, f;) on the graph of f(x) (see Figure 
9.4.2), and the inequality expresses the fact that the graph of a convex function 
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Figure 9.4.2 Convex functions 


always lies below (or along) the chord. If K is an open set and f(x) is differentiable 
(€') on K, an equivalent definition of convexity is that for all xg, x, © K it follows 
that 


fi >fo + (Ki — X0)' Vf. (9.4.4) 


This definition shows that the graph of f(x) must lie above (or along) the linear- 
ization of f(x) about xo, and hence that this linearization acts as a supporting hyper- 
plane for the convex function. The equivalence of (9.4.3) and (9.4.4) is readily 
demonstrated. It follows from (9.4.3) that 


fo lcy, fy. 


But regarding xg as a point on the line xg = xg + 0(x; — Xq), and taking the limit 
6 { 0, then (9.4.4) follows. Conversely if (9.4.4) holds then expanding about xg, 


fi >fo + (1 — x9) Vio 
fo > fo + (Ko — X9)' Via 
so that 
(1 — 6)fo + Of; > fo + (1 — 8)(Ko — X9) + O(K1 — Xo)" Vio = So 
which is (9.4.3). 
Another result which follows from (9.4.4) is that 
(K1 — x0)" WA Shi — fo > 1 — X0)" Vf. (9.4.5) 


This illustrates the fact that the slope of a convex function is non-decreasing along 
any line. In fact this result (for the directional derivative) can also be proved to 
hold for a non-differentiable convex function. Finally for twice differentiable (C7) 
convex functions and K open, another equivalent definition of a convex function is 
that 


V7fo is positive semi-definite Wx EK. (9.4.6) 


Thus convex functions are typified by having non-negative curvature. To establish 
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this result, let s# 0 and let x; = xo + as. Then a Taylor series for Vf, gives 
Vii =Vifo taV7fos + o(a). (9.4.7) 


Substituting in (9.4.5) and taking the limit a > 0 gives s! V7f5s > 0, which is 
(9.4.6). Conversely, a Taylor series for f, (with @ € [0, 1]) and (9.4.6) yields 


fi =fo + &1 —X0)' Vio + 4078' Vfos 
> fy + (x1 —Xo)' Vio, 


which is (9.4.4). 

Other definitions which are closely related to that of a convex function are the 
following. A strictly convex function is defined whenever the inequality in (9.4.3) 
is strict for all distinct xp, x, and @ € (0, 1). For €' functions (9.4.4) again pro- 
vides an equivalent definition when the inequality is strict and x9 #x,. However 
for ©? functions, (9.4.6) is not equivalent, since although V*fo positive definite 
V xq €K implies that f(x) is strictly convex, the converse result does not hold 
(for example x* is strictly convex but has zero Hessian at x = 0). A concave func- 
tion f(x) is defined as one for which —f(x) is convex, and so is associated with non- 
increasing slope or non-positive curvature. Likewise a strictly concave function f(x) 
has —f(x) strictly convex. 

Examples of convex functions include the linear function, which is both convex 
and concave. A quadratic function is convex when the Hessian is positive semi- 
definite and strictly convex when the Hessian is positive definite. Another convex 
function is I x ll (for any norm). However || c(x) ll, where c(x) maps IR” on IR”, is 
not generally convex, except when c(x) is a linear function. A transformation which 
preserves convexity is expressed in the next result. 


Lemma 9.4.2 

If f(x), i= 1, 2,...,m, are convex functions on a convex set K, and if \; > 0, then 
2A; f;(x) is a convex function on K. 

Proof 


Take xg as in (9.4.1) and use the definition of a convex function. O 





The problem of minimizing a convex function on a convex set K is said to be a 


convex programming problem. Such a problem arises when (7.1.1) can be expressed 
as 


minimize f(x) 948 
subject tox €K 4{x|c;(x)20,i=1,2,...,m} (9.4.8) 
where f(x) is a convex function on K, and the functions c;(x), i= 1,2,...,m,are 


concave on IR”. That the feasible region in (9.4.8) is a convex set is a consequence 
of the following lemma and of lemma 9.4.1. 
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Lemma 9.4.3 
If c(x) is a concave function then the set 
S(k) = {x | c(x) > k} 


is convex. 


Proof 


For xo, X; © S(k), and if xg is given by (9.4.1), it follows by concavity that 
Co 2 (1 — O)cg + Oc, > (1 — O)K + OK 
by definition of S(k). Thus xg © S(k) which is therefore convex. O 


Notice that the system (9.4.8) does not allow general equality constraints, 
although it is possible to include any linear equality c(x) = 0 as the intersection of 
c(x) = 0 and —c(x) > 0. An example of a convex programming problem is there- 
fore the linear programming problem (linear objective function, linear equality and 
inequality constraints). Another example is the quadratic programming problem 
(quadratic objective function, linear constraints) when the Hessian of the quadratic 
function is positive semi-definite. However it should be noticed that quadratic 
programming problems can (and do) exist which have well-behaved local (or even 
unique global) solutions, yet for which the Hessian is indefinite. It is then erroneous 
to assume that some of the consequences of convexity (see next section) will apply 
in this case. 

One main attraction of convexity is that it provides an overall assumption where- 
by the existence of local but not global solutions can be excluded, as the next 
theorem shows. An additional assumption, given in the corollary, enables unique- 
ness of global solutions to be established. 


Theorem 9.4.1 


Every local solution x* to a convex programming problem (9.4.8) is a global 
solution, and the set of global solutions S is convex. 


Proof 


Let x* be a local but not global solution. Then 4x, € K such that f; < f*. For 

6 € [0, 1], consider xg = (1 — 0)x* + 0x, © K by convexity of K. By convexity of 
f, fg <A —9)f* + Of; =f* + 0(f — £*)< Sf". In the limit 6 > 0 the local sol- 
ution property is contradicted. Thus local solutions are global. Now let x9, x; © S 
and define xg by (9.4.1). By the global solution property, fg > fo = fi. By con- 
vexity fg <(1 — 0)fo + Of; = fo =fi. Thus fo = fo =f; and so xg € S, so S is 
convex. O 
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Corollary 


If also f(x) is strictly convex on K then any global solution is unique. 


Proof 
Let x9 #x, €S and 6 € (0, 1). As above, but using strict convexity, both 
fo >fo =f; and fg <fo =f, which is a contradiction. O 


A second attraction of a convexity assumption is that it provides a framework 
within which the first order (Kuhn—Tucker) conditions are sufficient for a local 
solution, as the next theorem shows. In common with other sufficient conditions 
(theorem 9.3.2), no regularity assumption is required. 


Theorem 9.4.2 


In the convex programming problem (9.4.8), if f(x) and c;(x), i= 1, 2,...,m, are 
€! functions on K and if conditions (9.1.16) hold at x™, then x* isa global solution 
to (9.4.8). 


Proof 


Let x' © K, x' #x”™. Then since 4; > 0 and c; > 0, 


m 
fief selec 


i=l 
Sp + exter Dy ee = x) a; ), 


using (9.4.4), since f is convex and the c; are concave. Then from (9.1.16), Ac; = 0 
and g* = La;‘A; show that f’ >f* and hence x” is a global solution. 0 


To summarize, for x* to solve the convex programming problem (9.4.8), con- 
ditions (9.1.16) and a regularity assumption (9.2.4) are necessary, whilst (9.1.16) 
alone are sufficient. Of course (9.2.4) is implied by the constraint qualification 
F* = ¥* which in turn is implied by the assumptions of lemma 9.2.2, as before. 
It should be emphasized that it is not possible to dispense with the regularity 
assumption (9.2.4). An example of a convex feasible region in which F* # ¥* 
(in both IR? and IR*) is given by the inequalities x, 2 xi andx> <Oatx*=0; 
An illustration of a (regular) convex programming problem is provided by problem 
(9.1.15). The objective function is linear and hence convex. The Hessian matrices 
Vc, = ed | and Vc, = [e “a are negative semi-definite so the con- 
straint functions are concave. It can be seen from Figure 7.1.3 that the feasible 
region is a convex set. Since first order conditions hold at x* = (1/2, 1//2)! it 
follows from theorem 9.4.2 that x* is a global solution to the problem. 
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It can be seen from the above theorems that a convexity assumption is in the 
nature of a curvature or second order assumption, in that the directional derivative 
is non-decreasing along any line. Therefore, although convexity gives useful results 
for certain special types of problem, it is an assumption which does not often hold 
in the general case. A weakening of the assumptions is to require convexity of 
f(x) and —c;(x), i= 1, 2,...,m, ona ball about x”. In this case theorem 9.4.2 
can be interpreted as stating that local convexity and (9.1.16) is sufficient for a 
local solution. Even in this form, however, the assumption is not valid for many 
problems. The requirement is essentially that the matrices V7f* and —V7c;, 
i=1,2,...,m, are positive semi-definite. The second order conditions of theorem 
9.3.2 involve the much weaker assumption (9.3.14) that W* = V7f* — DAF V2c; 
is positive definite only on a restricted subspace, and it is only rarely that this con- 
dition does not hold at a local solution. Thus convexity does not provide a valid 
model situation for the general non-linear programming problem from which to 
derive an algorithm, whereas the second order conditions can do this (see Section 
1s): 


9.5 Duality 


The concept of duality occurs widely in the mathematical programming literature. 
The aim is to provide an alternative formulation of a mathematical programming 
problem which is more convenient computationally or has some theoretical signifi- 
cance. The original problem is referred to as the primal and the transformed prob- 
lem as the dual. Usually the variables in the dual (or some of them) can be inter- 
preted as Lagrange multipliers and take the value 4* at the dual solution, where 

2* is a multiplier vector associated with a primal solution x”. In this sense the 
method of Lagrange multipliers (Section 9.1) might be thought of as a dual method. 
Usually, however, there is also present an objective function (often related to the 
Lagrangian function (9.1.6)) which has to be optimized. Duality theory of this kind 
is associated with a convex programming problem as the primal, and it is important 
to realize that if the primal is not convex then the dual problem may well not have 
a solution from which the primal solution can be derived (see Question 9.23). Thus 
it is not valid to apply the duality transformation as a general purpose solution 
technique. 

In this book the emphasis is on duality transformations which are convenient 
computationally, and it seems that these can largely be deduced as a consequence 
of one particular form known as the Wolfe dual (Wolfe, 1961). This is a very simple 
result, closely related to the first order conditions (9.1.16), but which replaces 
constraint conditions by an optimality requirement on the Lagrangian function. 


Theorem 9.5.1 


If x* solves the convex programming primal problem (9.4.8), if fandc;, i= 1,2, 
...,m, are C! functions, and if the regularity assumption (9.2.4) holds, then 
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x”, 4* solves the dual problem 
maximize #(x,A) 
xen : (9.5.1) 
subject to V,. L(x, 1) = 9, 12> 0. 


Furthermore the minimum primal and maximum dual function values are equal, 
that is f* = L(x*, 4°). 


Proof 


The conditions of the theorem are those of theorem 9.1.1 so it follows that multi- 
pliers 4* > 0 exist such that V, L(x", 4*) = 0 (dual feasibility), and jc; = 0, 
(=>. 2m, from which f* = L(x", eh follows. Now let x, 4 be dual feas- 
ible. Then using 4 > 0, convexity of # (lemma 9.4.2), and V,# = 0 in turn, 


Ye. jeff Sf Suu = AG a) 
I 


SL ays Oe xy Va) 
=L(x,1). 
Hence &(x*, 4*) > L(x, 2) and so x”, 4” solves the dual. O 


An apparent disadvantage of the Wolfe dual is that the symmetry in which the 
primal is the dual of the dual is not generally present, since the dual may not even 
be a convex programming problem. However this does not affect the computational 
value of the Wolfe dual, and in fact for some of the transformed problems which 
can be deduced directly from the Wolfe dual, this symmetry does hold. It is also 
important to consider what happens in the dual if the primal has no solution; this 
point is taken up at the end of this section. 

The first example of the dual transformation is in linear programming (LP). The 
primal problem 


minimize fo + clx 
x (9.5.2) 
subject to A'x >b 


is not in standard form. Section 7.2 describes how to obtain a standard form by 
including both slack variables and also non-negative variables x* and x~ (since the 
bounds x 2 0 are not present). However if there are many more inequalities than 
variables (m > n) it is much more attractive to disregard this transformation and 
instead to use the Wolfe dual. This is valid because the linear functions in (9.5.2) 
imply both that (9.5.2) is a convex programming problem, and that the regularity 
assumption (9.2.4) is true. The dual problem (9.5.1) becomes 


maximize fy + c'x — 41(A'x — b) 
xX,A 
subject toc — AA=0, A> 0. 











al 


On substituting for c in the objective function, the problem becomes independent 
of x, giving 


maximize fy + bl A 
(9.5.3) 
subject to AA=c, A>0 


which is an LP problem in standard form. Once this problem has been solved for 
4*, the variables \;*, i € B*, will have Aj > 0 (ignoring the possibility of degener- 
acy) which implies that c;' = 0 from (9.1.16). Thus the solution of the square 
system of equations a}x = b;, i€ B*, determines the vector x* which minimizes 
the primal. 
Another example in linear programming arises by considering the primal problem 
minimize fo + c!x 
x (9.5.4) 
subject to A'x > b, x20. 


Introducing multipliers 4 and z respectively, then the Wolfe dual (9.5.1) becomes 
maximize fp + c'x — 21 (A!x — b)— n!x 
KA 
subject toc — AA— n=0, A> 0, 12 0. 


Substituting for c in the objective eliminates both the x and m variables to give the 
problem 
maximize fy t b' A 
(9.5.5) 
subject to AAS c, A= 0. 


Since this problem is like (9.5.4) in having both inequality constraints and bounds 
it is often referred to as the symmetric dual. Its use is advantageous when A has 
many fewer rows than columns. Then (after adding slack variables) the standard 
form arising from (9.5.5) is much smaller than that arising from (9.5.4). 
An extension of problem (9.5.4) is to include equality constraints as 
minimize fo + clx 
x 
subject to Alx > b, 
Adx = bp, x20. 


(9.5.6) 


This problem can be reduced to (9.5.4) by writing the equality constraints as 
Ax > by and —A}x > —bp. Introducing non-negative multipliers 4), rey ee 
and 7, and defining A. =A —Az, u=(*1) »b= eal and A=[A,: A,], 
2 2 
then from (9.5.5) the dual can be written 
minimize fy + b'A 
(9557) 


V 
° 


subject to AA<c, M 
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In general the strengthening of a primal linear inequality constraint to become an 
equality just causes the bound on the multiplier in the dual to be relaxed. It is 
readily observed that all these duals arising from linear programming have the sym- 
metry property that the dual of the dual simplifies to give the primal problem. 
Another useful application of the dual is to solve the primal quadratic program- 

ming (QP) problem 

minimize }x'Gx + g'x 

“s (9.5.8) 
subject to A'x > b 


in which G is positive definite. The assumptions of theorem 9.5.1 are again clearly 
satisfied, so from (9.5.1) the Wolfe dual is 
maximize 4x'Gx + g'x — 47 (Ax — b) 
XA 
subject to Gx +g — AA=0, A= 0. 


Using the constraints of this dual to eliminate x from the objective function gives 
the problem 


maximize —}4™(ATG~1A)a + AT(b— A'G~1g) — 4g'Go 1g 
i (9.5.9) 
subject toa = 0. 


This is again a quadratic programming problem in the multipliers A, but subject 
only to the bounds A > 0. As such this can make the problem easier to solve. Once 
the solution A* to (9.5.9) has been found then x” is obtained by solving the 
equations Gx = AX.* — g used to eliminate x. In a similar way to (9.5.6), the ad- 
dition of equality constraints into (9.5.8) causes no significant difficulty. 

An example of duality for a non-quadratic objective function is in the solution 
of maximum entropy problems in information theory (Eriksson, 1980). The primal 
problem is 


n 
minimize 2 x; log(x;/c;) 
wid Loe (9.5.10) 


subject to A'x= b, x20 


where the constants c; are positive and A ism x m with m <n. One possible method 
of solution is to eliminate variables as described in Section 11.1. However since 

m <n this does not reduce the size of the problem very much and it is more advan- 
tageous to consider solving the dual. For c > 0, the function f(x) = x log(x/c) is 
convex on x > 0 since f"(x) > 0 and hence by continuity is convex onx > 0 

(0 log 0 = 0). Thus (9.5.10) is a convex programming problem. Introducing multi- 
pliers A and m respectively, the condition V, Y = 0 in (9.5.1) becomes 


log(x;/e;) + 1—eFAA— 7; =0 (975,81) 
fori=1,2,...,n. It follows that 


u3 
xp=C; exp(esAA+ 1; — 1) (9.5.12) 


and hence that x; >0 V i. Thus the bounds x > 0 are inactive with n* = 0 and can 
be ignored. After relaxing the bounds 4 > 0 to allow for equality constraints the 
Wolfe dual (9.5.1) thus becomes 


maximize Lx; log(x;/¢;) — 4™(A'x — b) 
X,A 


subject to (9.5.11). These constraints can be written as in (9.5.12) and used to 
eliminate x; from the dual objective function. Thus the optimum multipliers 4* 
can be found by solving 


minimize h(A)4 De; exp(e} AA — 1) — A™b 
yy i 


without any constraints. This is a problem in only m variables. It is easy to show 
that 


V,h=A'x—b 
Vin=A!l[diagx,]A 


where x; is dependent on A through (9.5.12). Since these derivatives are available, 
and the Hessian is positive definite, Newton’s method with line search (Volume 1, 
Section 3.1) can be used to solve the problem. For large sparse primal problems 
Eriksson (1980) has also investigated the use of conjugate gradient methods. 

An example of the use of duality to solve convex programming problems in 
which both the objective and constraint functions are non-linear arises in the study 
of geometric programming. A description of this technique, and of how the Wolfe 
dual enables the problem to be solved efficiently, is given in detail in Section 13.2. 

It is important to consider what happens if the primal has no solution. In par- 
ticular it is desirable that the dual also does not have a solution, and it is shown that 
this is often but not always true. The primal problem can fail to have a solution ina 
number of ways and firstly the case is considered in which the primal is unbounded 
(f(x) > —9). A useful result is then as follows. 


Theorem 9.5.2 
If V is the infimum of f(x) for feasible x in the primal problem (9.4.8), and vis the 
supremum of £(x, 1) for feasible x, 4 in the dual, then V 2 v. 


Proof 


Let x’ be primal feasible and x, A dual feasible. Then by convexity of f, dual feasi- 
bility, concavity of c;, and non-negativity of c, and ), in turn, it follows that 


f' —f> 8" («' — x)= Daj &' — x) 


m> LAE; — c;)2- DAjC;- 
i i 
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Hence f’ >f — D;;c; = L, so taking the infimum over all x’ and the supremum 
over all x, A it follows that V>v. | 


If the primal problem is unbounded it follows that V = v = —co and this is not 
possible if any feasible x, A exists. Thus an unbounded primal implies an inconsist- 
ent dual. 

Next the case in which the primal constraints are inconsistent is considered. This 
result is implied by theorem 9.5.2 if the dual is unbounded; however the converse 
is not always true. An example of this is given in Question 9.24 in which although 
the primal is inconsistent yet the dual has a solution. However for linear constraints 
this possibility is excluded. In this case the constraints can be written c(x) 4 
A'x — b> 0. Now the set {x : A'x > b} is empty if and only if there exists a 
vector A > 0 such that A= 0 and b! A> 0. (This result is similar to Farkas’ lemma 
(lemma 9.2.4) and is proved in a similar way.) Now let x, A be dual feasible and A’ 
be the vector which exists above. Then (x, A + aA’) is dual feasible for all a> 0 and 


L(x, tan) = L(x,A)+ arb 


which > © as a > e%. Thus for linearly constrained problems, if the primal is infeas- 
ible and the dual is feasible, then the dual is unbounded. It is also possible that both 
the primal and the dual may be infeasible. Thus using the Wolfe dual for linear con- 
straint problems is always satisfactory in that a failure to solve the dual always 
implies that the primal has no solution. 

A final possibility which should also be considered (although it cannot occur for 
linear or quadratic programming problems) is that the primal (or dual) problem is 
bounded but has no solution. In this case there are open questions about the nature 
of solutions to the dual (or primal) problems and the situation is described in more 
detail by Wolfe (1961). 


Questions for Chapter 9 


a 
1. Verify that the points x'=| 0 Jandx” =| 4 | satisfy Kuhn—Tucker (first 
0 5 


order necessary) conditions for the problem 
minimize X2+xX3 
subject to x; +x, +x3=1 
x? +x +x3 =A! 


and evaluate the corresponding Lagrange multipliers. 


2. By drawing a diagram of the feasible region and the contours of f(x) determine 
the solution of the problem 


minimize f(x) 4 —x,; +x, 
subjectto O< x, <a 
O0< x2 <1 

xX 2x? 


1S 


where a is a fixed positive constant. Show that the set of active constraints at 
the solution differs according to whether or not a is greater than a certain fixed 
value a, and determine @. Obtain the Lagrange multipliers of the active con- 
straints in both cases and verify that the KT conditions are satisfied. 
. A parcel has its longest side of length x, and its two other sides are of length 
xX and x3. Postage regulations are that each dimension should be no greater 
than 42 in, and that the total girth (that is 2(x2 + x3)) plus length should be no 
greater than 72 in. State the constrained minimization problem which deter- 
mines the parcel of maximum volume which is permissible. Use the symmetry 
between x» and x3 to eliminate x3, draw a diagram of the feasible region, and 
show (approximately) how the objective function behaves. Identify two 
possibilities for the set of active constraints at the solution. Solve these (as if 
equality constraints) for x and, and determine at which point the multipliers 
satisfy KT conditions for inequality constraints. 
. It is desired to build a warehouse of width x;, height x2, and length x3 (in 
metres), with capacity 1500 m?. Building costs per square metre are: walls £4, 
roof £6, floor plus land £12. For aesthetic reasons, the width should be twice 
the height. State the problem which determines the dimensions of the ware- 
house of minimum cost and write down the KT conditions. By eliminating x , 
and x3, show that to the nearest metre, x2 = 10 minimizes the cost, and hence 
find x, andx 3. Determine the optimum multipliers in the KT conditions. 

It can be shown that changing c;(x) = 0 to c;(x) = ¢; in the problem induces 
a change ),¢; (to first order) in f(x) at the resulting solution. Estimate the 
change in cost on reducing the required capacity by 10 per cent. 
. List all the stationary points of the function 


Wei 4x5— 16x53; 
subject to the constraint c(x) = 0, where c(x) is given in turn by 


G) c&) =x, —4, 
(ii) e(x) =x 4x2 — 1, 
(iii) C(x) =X1X2%x3 —1. 


. Solve the problem 


minimize f(x,y) 4x? +y? +3xy + 6x + 19y 
subject to 3y + x = 5. 


. a,b, andc are positive constants. Find the least value of a sum of three posi- 
tive numbers x, y, and z subject to the constraint 

1:6 aE 

—+—-+-—=1 

PD Pe 
by the method of Lagrange multipliers, assuming that the positivity conditions 
are not active. 
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. Consider the problem 


minimize 4a(x, — 1)? —x, —x2 
x 


subject to x; 2 Yaa iit 5 <1. 
Find all the points at which both the constraints are active. One of these points 


is x’ = (0.618, 0.786)! to three decimal places. Working to this accuracy, find 
the range of values of a for which x’ satisfies the KT conditions. 


. Consider the problem 


maximize x 
subject to (3 —x,)? — (x2 —2)20 
3X ar 885) ZY). 
(i) Derive the KT conditions for this problem, and find all solutions of these. 
(ii) Solve the problem graphically. 


(iii) Repeat the analysis of (i) and (ii) for the same problem with the additional 
constraint 


2x, —3x220. 


. Under what conditions on the problem are the KT conditions (a) necessary, 


(b) sufficient, (c) necessary and sufficient for the solution of an inequality 
constrained optimization problem? 
Form the KT conditions for the problem 


maximize (x + 1)? +(y +1)” 
subject to x7 +y? <2 
y Sil 


and hence determine the solution. 
Find the point on the ellipse defined by the intersection of the surfaces 
x +y=1andx? + 2y? +z? =1 which is nearest to the origin. Use (i) the 
method of Lagrange multipliers, (ii) direct elimination. 
A bookmaker offers odds of 7; : 1 against each of n runners in a race. A punter 
bets a proportion x; of his total stake t on each runner. Assume that only one 
runner can win and thatr,; >r, >--:>r, > 0. Clearly the punter can guaran- 
tee not to lose money if and only if 

Ti Xp 2k, bavi eon. 

jFi 

Show that this situation can arise if 2; 1/(r; + 1) < 1. (Consider x; = c/(r; + 1) 
where 1/c = 2; 1/(7; + 1).) 

If this condition holds, and if it is equally likely that any runner can win, 
the expected profit to the punter is (t 2;(7; + 1)x;) — t. Show that the choice 
of the x; which gives maximum expected profit, yet which guarantees no loss 
of money, can be posed as a constrained minimization problem involving only 
bounds on the variables and an equality constraint. By showing that the KT 
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19. 


20. 
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conditions are satisfied, verify that the solution is 
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Consider the relationship between the method of Lagrange multipliers in 
Section 9.1 and the direct elimination method in Section 7.2 and Question 
7.5. In the latter notation, show that (9.1.3) can be rearranged to give 

* = [Aj] ~1g;. Hence use the result of Question 7.5 to show that if x* 
satisfies (9.1.3) then it is a stationary point of the function W(x) in (7.2.2). 
Attempt to solve the problem in Question 7.4 by the method of Lagrange 
multipliers. Show that either y = 0 or A = —1 and both of these imply a contra- 
diction, so no solution of (9.1.5) exists. Explain this fact in terms of the 
regularity assumption (9.2.4) and the independence of the vectors a;, i € E. 
Show that if the matrix A* in (9.1.3) has full rank then the Lagrange multi- 
pliers A* are uniquely defined by 4* = A**g* or by solving any m x m 
subsystem Aj A” = g; where Aj is non-singular (see Question 9.13). 
Computationally the former is most stable if the matrix factors A* = QR 

(Q orthogonal, R upper triangular) are calculated and 4* is obtained by back 
substitution in R2* = Q'g*. 

By examining second order conditions, determine whether or not each of the 
points x’ and x” are local solutions to the problem in Question 9.1. 

By examining second order conditions, determine the nature (maximizer, 
minimizer, or saddle point) of each of the stationary points obtained in 
Question 9.5. 

Given an optimal b.f.s. to an LP, show that the reduced costs cy are the 
Lagrange multipliers to the LP after having used the equations Ax = b to 
eliminate the basic variables. 

For the LP in Question 8.1, illustrate (for non-negative x) the plane 

3x1 +X +x 3 = 12 and the line along which it intersects the plane 

Xx, —X 2 +x3 =—8. Hence show that the feasible region is a convex set and 
give its extreme points. Do the same for the LP which results from deleting 
the condition x; — x2 +.x3 =—8. 

For the LP (8.1.1) in standard form, prove that 


(i) 4a solution = dan extreme point which is a solution, 
(ii) x is an extreme point > x has p <™m positive components, 
(iii) x is an extreme point and A has full rank 
=> Jab.f.s. at x (degenerate iff p<m), 
(iv) dab.f.s. at x => x is an extreme point. 


(By a b.f.s. at x is meant that there exists a partition into B and N variables 
such that xp > 0, xy = 0, and Ag is non-singular (see Section 8.2).) Part (i) 
follows from the convexity of S in theorem 9.4.1, the result of Question 9.21, 
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21, 


7: 


23. 


24. 


and the fact that an extreme point of S must be an extreme point of (8.1.1) by | 
the linearity of f. The remaining results follow from Question 9.22. Part (ii) is 
true since if Ap has rank p then p <™m. Part (iii) is derived by including all 
positive components of x in B. If p <™m, other independent columns of A exist 
which can augment Ap to become a square non-singular matrix, and the corre- 
sponding variables (with zero value, hence degeneracy) are added to B. In part 
(iv) the non-singularity of Ag implies that Ap has full rank and hence x is 
extreme. 

Consider a closed convex set K C IR® = {x : x > 0}. Show that K has an ex- 
treme point in any supporting hyperplane S = {x : c!x =z}. Show that 
TAKOS is not empty. Since TC IR®, construct an extreme point in the 
following way. Choose the set T, C T such that the first component f, of all 
vectors t€ 7, is smallest. Then choose a set T, C T, with the smallest second 
component, and so on. Show that a unique point t™ is ultimately determined 
which is extreme in 7. Hence show that t™ is extreme in K (Hadley, 1962, 
chapter 2, theorem III). 

Let x be any feasible point in (8.1.1), with p positive components, let Ap be 
the matrix with columns a; Vi: x; >0 and xp likewise. Show that x is an 
extreme point of (8.1.1) iff Ap has full rank, in the following way. If Ap has 
not full rank there exists u# 0 such that Apu = 0. By examining xp + eu show 
that x is not extreme. If Ap has full rank, show that xp = Apb is uniquely 
defined. Let x = (1 — 0)xg + 0x,, 6 € (0, 1), Xo, x, feasible, and show that 
the zero components of x must be zero in xp and x,. Hence show that the 
remaining components are defined by Apb and hence Xo =X, =x, So that 

x is extreme. 





-Show that the dual of the problem 


minimize 40x? +4x3 +x, 
subject tox; 20 


is a maximization problem in terms of a Lagrange multiplier \. For the cases 
o = +1 and o = —1, investigate whether the local solution of the dual gives the 
multiplier \* which exists at the local solution to the primal, and explain the 
difference between the two cases. 

Consider the problem 


minimize f(x) 40 
subject to c(x) 4—e* >0. 
Verify that the constraint is concave but inconsistent, so that the feasible 


region is empty. Set up the dual problem and show that it is solved by \ = 0 
and any x. 


Chapter 1 0 


Quadratic Programming 


10.1 Equality Constraints 


Like linear programming problems, another optimization problem which can be 
solved in a finite number of steps is a quadratic programming (OP) problem. In 
terms of (7.1.1) this isa problem in which the objective function f(x) is quadratic 
and the constraint functions c;(x) are linear. Thus the problem is to find a solution 
x" to 

minimize q(x) 4 4x'Gx + g!x 

x 

subject to a/x = b;, ice, Oot) 

a}x > b,, FEL 


where it is always possible to arrange that the matrix G is symmetric. As in linear 
programming, the problem may be infeasible or the solution may be unbounded; 
however these possibilities are readily detected in the algorithms, so for the most 
part it is assumed that a solution x* exists. If the Hessian matrix G is positive 
semi-definite, x™ is a global solution, and if G is positive definite, x* is also unique. 
These results follow from the (strict) convexity of g(x), so that (10.1.1) is a 
convex programming problem and theorem 9.4.1 and its corollary apply. When the 
Hessian G is indefinite then local solutions which are not global can occur, and the 
computation of any such local solution is of interest (see Section 10.4). A modern 
computer code for QP needs to be quite sophisticated, so a simplified account of 
the basic structure is given first, and is amplified or qualified later. In the first case 
it is shown in Sections 10.1 and 10.2 how equality constraint problems can be 
treated. The generalization of these ideas to handle inequality constraints is by 
means of an active set strategy and is described in Section 10.3. More advanced 
features of a QP algorithm which handle an indefinite Hessian and allow a sequence 
of problems to be solved efficiently are given in Section 10.4. QP problems with a 
special structure such as having only bounded constraints, or such as arise from 
least squares problems, are considered in Section 10.5. Early work on QP was often 
presented as a modification of the tableau form of the simplex method for linear 
programming. This work is described in Section 10.6 and has come to be expressed 
in terms of a linear complementarity problem. The equivalence between these 


79 


80 


methods and the active set methods is described and reasons for preferring the 
latter derivation are put forward. 

Quadratic programming differs from linear programming in that it is possible to 
have meaningful problems (other than an n x n system of equations) in which there 
are no inequality constraints. This section therefore studies how to find a solution 
x* to the equality constraint problem 





TM q(x) 4 4x'Gx + g'x 
(10.1.2) 
subject to A'x = b. 


It is assumed that there are m <n constraints so that b € IR™, and A isn x m and 
collects the column vectors a;, i€ F, in (10.1.1). It is assumed that A has rank m; 
if the constraints are consistent this can always be achieved by removing dependent 
constraints, although there may be numerical difficulties in recognizing this situ- 
ation. This assumption also ensures that unique Lagrange multipliers 0* exist, and 
calculation of these quantities, which may be required for sensitivity analysis or for 
use in an active set method, is also considered in this section. 

A straightforward way of solving (10.1.2) is to use the constraints to eliminate 


variables. If the partitions 
_{*1 = ‘ fs a aksy oe 
x= ; A= . = ‘ C= 
es) Ve te Go; G2 


are defined, where x; € IR” and x, € IR”~”, etc., then the equations in (10.1.2) 
become A}x, + A3x> = band are readily solved (by Gaussian elimination, say) to 
give x, in terms of x2; this is conveniently written 


=A (0 = -ALxs). (10.1.3) 


Substituting into q(x) gives the problem: minimize (x2), x2 © IR” where (x2) 
is the quadratic function 


W(x.) = 4x4 (Gp2 — Gp, Az ‘AS — Ap AT !Gy2 + AZA] G1, A] AF)x, 





+x} (Gy, — A2A7!G,,)Az/b+4b'AzT'G,, Az b 


+ x3 (g2 — ApAy'g1) + gtA7z°b. (10.1.4) 


A unique minimizer x} pasts if the Hessian V* y in the quadratic term is positive 
sou in which case x3 is obtained by solving the linear system VW(x,)=0. 
Then x; is poune By: substitution in (10.1 3), The Lagtange multiplier vector 4* is 
defined by g* = AA* (Section 9.1) where g* = Vq(x*), and can be calculated by 
solving the first partition gy = Ayh™. By definition of g(x) in (10.1.2), g* = g + Gx* 
so an explicit expression for A* is 


AS = AT"(g1 + Gy 1x7 + G1 2x3). (10.1.5) 


An example of the method is given to solve the problem 


81 
oa . A 2. 2 9) 
minimize q(x)4xj+x>+x§ 
x 


Subjecttox; +2x,—x2= 4 (10.1.6) 


Xi, ~— X2 a x3 =). 
To eliminate x3, the equations are written 
Xj 2ks =e 4X, 
25h] ko = m3 
and are readily solved by Gaussian elimination giving 


xp=O—4x5, 9x2 =243x3. (10.1.7) 


It is easily verified that this solution corresponds to (10.1.3) with x, = (3B ) 
X2 


1 1 
X. = (x3), Aj = ib a! ,and A, =[—1 1]. Substituting (10.1.7) into g(x) in 


(10.1.6) gives 
W(x3) = 49x35 +§x3+4 (10.1.8) 


opti) 
QO 2 
and g = 0. The Hessian matrix in (10.1.8) is [72] which is positive definite, so the 
minimizer is obtained by setting Vy = 0 and is x = —$. Back substitution in 
(10.1.6) gives x} =4 andx}> = 42. The system g* = AA” becomes 


which corresponds to (10.1.4) with G, ; = | | G12 = G4, =0, G22 = [2], 


2 1 1 r* 
2/10}=| 2 -1 (*1) 
—6 —] 1 2 


and solving the first two rows gives \j = 8 and A> = —4, and this is consistent with 
the third row. 

Direct elimination of variables is not the only way of solving (10.1.2) nor may it 
be the best. A generalized elimination me thod is possible in which essentially a 
linear transformation of variables is made initially. Let S and Z be n x m and 
n x (n— m) matrices respectively such that [S : Z] is non-singular, and in addition 
let A'S =Iand A'Z=0. S! can be regarded as a left generalized inverse for A so 
that a solution of A'x = b is given by x = Sb. However this solution is non-unique 
in general and other feasible points are given by x = Sb + 6 where 6 is in the null 
column space of A. This is the linear space 


{5:A'8=0} (10.1.9) 
which has dimension n — m. The purpose of the matrix Z is that its columns 
Z1,Z2,--+5Zn—m act as basis vectors (or reduced coordinate directions) for this 


null space. That is to say, at any feasible point x any feasible correction 6 can be 
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Figure 10.1.1 | Reduced coordinates for the 
feasible region 


written as 
n—m 
6 = Zy = >> ZiVi (10.1.10) 
i=1 
where y1,2,---,n—m are the components (or reduced variables) in each re- 


duced coordinate direction (see Figure 10.1.1). Thus any feasible point x can be 
written 


x = Sb + Zy. (10.1.11) 


This can be interpreted (Figure 10.1.2) as a step from the origin to the feasible 
point Sb, followed by a feasible correction Zy to reach the point x. Thus (10.1.11) 
provides a way of eliminating the constraints A'x = b in terms of the vector of 
reduced variables y which has n — m elements, and is therefore a generalization of 
(10.1.3). Substituting into g(x) gives the reduced quadratic function 


W(y) = 4y'Z™GZy + (g + GSb)"Zy + 4(g + GSb)'Sb. (10.1.12) 
If Z'GZ is positive definite then a unique minimizer y* exists which (from 
Vw(y) = 0) solves the linear system 

(Z'GZ)y = —Z'(g + GSb). (10.1.13) 


The solution is best achieved by computing LL? or LDL? factors of Z'GZ which 
also enables the positive definite condition to be checked. Then x* is obtained by 
substituting into (10.1.11). The matrix Z™GZ in (10.1.12) is often referred to as 
the reduced Hessian matrix and the vector Z" (g + GSb) as the reduced gradient 
vector. Notice that g + GSb = Vq(Sb) is the gradient vector of q(x) at x = Sb 

(just as g is V q(0)), so reduced derivatives are obtained by a matrix operation 
with Z!. In addition, premultiplying by S* in g* = AA* gives the equation 


r* = STg* (10.1.14) 


which can be used to calculate Lagrange multipliers. Explicit expressions for x* and 
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2* in terms of the original data are 
x* = Sb — Z(Z'GZ)-!Z1 (g + GSb) (10.1.15) 
and 


r* = ST(g + Gx") 
= (St — S'GZ(Z'GZ)—'!Z™ )g + S'(G — GZ(Z™GZ)~!Z™G)Sb. (10.1.16) 


These formulae would not be used directly in computation but are useful in show- 
ing the relationship with the alternative method of Lagrange multipliers for solving 
equality constraint QP problems, considered in Section 10.2. 

Depending on the choice of S and Z, a number of methods exist which can be 
interpreted in this way, and a general procedure is described below for construct- 
ing matrices S and Z with the correct properties. However one choice of particular 
importance is obtained by way of any QR factorization (Householder’s, for 
example) of the matrix A. This can be written 


a=a[6]=(a O1[R]-aR (10.1.17) 


where Q is n xn and orthogonal, R ism x m and upper triangular, and Q, and Q, 
aren x m andn x (n — m) respectively. The choices 


<= A. = 'O,R Z=Q, (10.1.18) 


are readily observed to have the correct properties. Moreover the vector Sb in 
(10.1.11) and Figure 10.1.2 is observed to be orthogonal to the constraint mani- 
fold, and the reduced coordinate directions z; are also mutually orthogonal. The 
solution x* is obtained as above by setting up and solving (10.1.13) for y* and then 
substituting into (10.1.11). Sb is calculated by forward substitution in R'v=b 
followed by forming Sb = Q,v. The multipliers A* in (10.1.14) are calculated by 
backward substitution in 


RA = Qt". (10.1.19) 





a’x=b 


Figure 10.1.2 Generalized elimination in the 
special case (10.1.18) 
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This scheme is due to Gill and Murray (1974a) and I shall refer to it as the orthog- 
onal factorization method. It is recommended for non-sparse problems because it 
is stable in regard to propagation of round-off errors (see below). 

An example is given of the method applied to problem (10.1.6). The matrices 
involved are 


5 8 1 
S=;;] 4 2], Z=|-2 
au 4 =—3 


(unnormalized) and these can be represented by the QR decomposition 


1/&/6w an 4fx/2ihew W/s/ 14 5/6 Ne 613 
ASO) /6 rH 1/N/ 21 22/1 10. Kt 
1rJo- © Q/x/21 -=3/\/14 Ib 0 0 


The vector Sb = 5(2, 10, =o). and since g = 0 and GSb = 2Sb it follows that 
Z" (g + GSb) = 0. Hence y* = 0 and x* = Sb = 4(2, 10, —6)". Also g* = g + GSb = 
(2, 10, 6)" so * = (8, —4)! and these results agree with those obtained by 
direct elimination. 

A general scheme for computing suitable § and Z matrices is the following. 
Choose any n x (m — m) matrix V such that [A : V] is non-singular. Let the inverse 
be expressed in partitioned form 


[A:V]-/= EA (10.1.20) 


where S and Z are n x m and n x (n — m) respectively. Then it follows that STA =I 
and Z'A = 0 so these matrices are suitable for use in the generalized elimination 
method. The resulting method can also be interpreted as one which makes a 

linear transformation with the matrix [A : V], as described at the start of Section 
12.4. The methods described earlier can all be identified as special cases of this 
scheme. If 


v-(9| (10.1.21) 


is chosen, then the identity 


A 0 Fo Avi 0 st 
ik | Se erlelen| (10.1.22) 


gives expressions for S and Z. It can easily be verified that in this case the method 
reduces to the direct elimination method. Alternatively if the choice 


V=Q, (10.1.23) 


is made, where Q, is defined by (10.1.17), then the above orthogonal factorization 
method is obtained. This follows by virtue of the identity 





—1,qT 
[A: V]~* = [Q)R: Q.]7? [For |. (10.1.24) 
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which can also be expressed as 


1_| A" 
[A : V]~ =e ; (10.1.25) 
where A* = (ATA)~!A? is the full rank Penrose generalized inverse matrix, and 
(10.1.14) can be written 


= Ag" (10.1.26) 


in this case. In fact for any given S and Z, a unique V must exist by virtue of 
(10.1.20). However if Z is given but S is arbitrary then some freedom is possible in 
the choice of V; this point is discussed further in Question 10.6. The generalization 
expressed by (10.1.20) is not entirely an academic one; it may be preferable not to 
form S and Z but to perform any convenient stable factorization of [A : V] and 
use this to generate S and Z indirectly: in particular the simple form of (10.1.21) 
and (10.1.22) could be advantageous for larger problems or for large sparse 
problems. 

Another method which can be described within this framework is the reduced 
gradient method of Wolfe (1963a). In terms of the active set method (Section 10.3) 
the matrix V is formed from normal vectors a; of inactive constraints which have 
previously been active. Thus when a constraint becomes inactive the column a, in 
A is transferred to V so that [A : V]~‘ need not be recomputed and only the par- 
tition line is repositioned. When an inactive constraint becomes active, the in- 
coming vector a; replaces one column of V. The choice of column is arbitrary and 
can be made so as to keep [A : V] well conditioned. This exchange of columns is 
analogous to that in linear programming (see equation (8.2.14)) and developments 
such as those described in Section 8.5 for taking account of sparsity or maintaining 
stability can be taken over into this method. 

Other methods for QP also can be interpreted in these terms. Murray (1971) 
gives a method in which a matrix Z is constructed for which 


Z'GZ=D (10.1.27) 


where D is diagonal. In this method columns of Z can be interpreted as conjugate 
directions, and are related to the matrix (Z say) in the orthogonal factorization 
method by Z = ZL—!, where Z'GZ has factors LDL’ . Because of this the method 
essentially represents inverse matrix information and may be doubtful on grounds 
of stability. However it does have the advantage that the solution of (10.1.13) 
becomes trivial and no storage of factors of Z'GZ is required. It is also possible to 
consider Beale’s (1959) method (see Section 10.6) partly within this framework 
since it can be interpreted as selecting conjugate search directions from the rows 
of [A : V]~?. In this case the matrix V is formed from gradient differences y) 
on previous iterations. However the method also has some disadvantageous features. 
It is not easy to give a unique best choice amongst these many methods, because 
this depends on the type of problem and also on whether the method is to be con- 
sidered as part of an active set method for inequality constraints (Section 10.3). 
Moreover there are other methods described in Section 10.2, some of which are of 
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interest. However the orthogonal factorization method is advantageous in that 
calculating Z involves operations with elementary orthogonal matrices which are 
very stable numerically. Also this choice (Z = Q>) gives the best possible bound 


k(Z'GZ)<«(G) (10.1.28) 


on the condition number x(Z'GZ). This is not to say that it gives the best K(Z'GZ) 
itself, as Gill and Murray (1974a) erroneously claim. Indeed a trivial modification 
to Murray’s (1971) method above with D = I enables a Z matrix to be calculated 
for which x(Z'GZ) = 1. In fact the relevance of the condition number is doubtful 
in that it can be changed by symmetric row and column scaling in Z'GZ without 
changing the propagation of errors. Insofar as (10.1.28) does have some meaning it 
indicates that x(Z'GZ) cannot become arbitrarily large. However this conclusion 
can also be obtained in regard to other methods if careful attention is given to 
pivoting, and I see no reason why these methods could not be implemented in a 
reasonably stable way. On the other hand, an arbitrary choice of V which makes 
the columns of Z very close to being dependent might be expected to induce 
unpleasantly large growth of round-off errors. 


10.2 Lagrangian Methods 


An alternative way of deriving the solution x* to (10.1.2) and the associated multi- 
pliers 4* is by the method of Lagrange multipliers (9.1.5). The Lagrangian function 
(9.1.6) becomes 


SY (x,4)=4x'Gx + glx — AT(ATx — b) (10.2.1) 
and the stationary point condition (9.1.7) yields the equations 

V.f=0: Gxtg—AA=0 

V.%=0: A'x—b =0 


which can be rearranged to give the linear system 


[sr “alti)--G) nn 


The coefficient matrix is referred to as the Lagrangian matrix and is symmetric but 
not positive definite. If the inverse exists and is expressed as 


Gao oA] a baleHa cc eT 
ei toy Cl cael a (10.2.3) 
then the solution to (10.2.2) can be written 
x* =—_Hg+Tb (10.2.4) 
AST se (10.2.5) 


These relationships are used by Fletcher (1971) to solve the equality constraint 
problem (10.3.1) which arises in the active set method. Since b = 0 in that case 
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only the matrices H and T need to be stored. Explicit expressions for H, T, and U 
when G~! exists are 


H= Gai G_A(A'G-*A)t ALG! 
T=G 4a(A'G" tay"! (10.2.6) 
U=—(A'G ‘Ay 


and are readily verified by multiplying out the Lagrangian matrix and its inverse. 
Murtagh and Sargent (1969) suggest methods for linear constraints which use this 
representation by storing (A'G~!A)~! and G~!. However it is not necessary for 
G~'! to exist for the Lagrangian matrix to be non-singular. Neither of the above 
methods are recommended in practice since they represent inverses directly and so 
have potential stability problems. In fact if S and Z are defined by (10.1.20) then 
an alternative representation of the inverse Lagrangian matrix is 


H= Z(Z'GZ)"1z" 
T=S—Z(Z'GZ)"'!Z'Gs (10.2.7) 
U=S'GZ(Z'GZ)-'Z'Gs — s'Gs. 


This can be verified by using relationships derived from (10.1.20) (see Question 
10.7). In this case it follows that (10.2.4) and (10.2.5) are identical to the ex- 
pressions (10.1.15) and (10.1.16) derived in the generalized elimination method. 
Thus it can be regarded that the computation of S, Z, and the LL! factors of Z'GZ 
in any of the elimination methods is essentially a subtle way of factorizing the 
Lagrangian matrix in a Lagrangian method. In particular the orthogonal factor- 
ization method is stable and therefore valuable. Equations (10.2.7) prove that the 
Lagrangian matrix is non-singular if and only if there exists Z such that the matrix 
Z'GZ is non-singular (see Question 10.11). Furthermore x” is a unique local mini- 
mizer if and only if Z'GZ is positive definite by virtue of the second order con- 
ditions (Section 2.1) applied to the quadratic function (10.1.12). 

Some other observations on the structure of the inverse in (10.2.3) are the 
following. Firstly T'A =I so T' isa left generalized inverse for A. If Z'GZ is 
positive definite, H is positive semi-definite with rank n — m. It also satisfies HA =0 
so projects any vector v into the constraint manifold (since A'Hv = 0). If x) is 
any feasible point then it follows that A™x) = b, and g*) = g + Gx *) is the 
gradient vector of g(x) at x") Then using (10.2.6) it follows that (10.2.4) and 
(10.2.5) can be rearranged as 


x* = x(*) _ Hg(*) (10.2.8) 

cay (ariel (10.2.9) 
Equation (10.2.8) has a close relationship with Newton’s method and shows that H 
contains the correct curvature information for the feasible region and so can be 
regarded as a reduced inverse Hessian matrix. Therefore these formulae can be con- 


sidered to define a projection method with respect to a metric G (if positive defi- 
nite); that is if G is replaced by I then H= AA* =P which is an orthogonal projec- 
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tion matrix, and the gradient projection method of Section 11.1 is obtained. In 
fact it can be established that H = (PGP)", the generalized inverse of the projected 
Hessian matrix. : 

There are methods for factorizing a general non-singular symmetric matrix 
which is not positive definite, initially due to Bunch and Parlett (1971), which can 
also be used. This involves calculating LDL! factors of the Lagrangian matrix 
(with some symmetric pivoting) in which D is block diagonal with a mixture of 
1 x 1 and 2 x 2 blocks. This therefore provides a stable means of solving (10.2.2), 
especially for problems in which G is indefinite. However the method does ignore 
the zero matrix which exists in (10.2.2) and so does not take full advantage of the 
structure. Also since pivoting is involved the factors are not very convenient for 
being updated in an active set method. On the other hand, the Lagrangian matrix 
is non-singular if and only if A has full rank, so generalized elimination methods 
which form S and Z matrices in a reasonably stable way when A has full rank can- 
not be unstable as a means of factorizing the Lagrangian matrix as long as Z'GZ is 
positive definite. Thus the stable generalized elimination methods seem to be 
preferable. 

Yet another way of solving the Lagrangian system (10.2.2) is to factorize the 
Lagrangian matrix forward. Firstly LDL! factors of G are calculated which is stable 
so long as G is positive definite. Then these factors are used to eliminate the off- 
diagonal partitions (A and —A? ) in the Lagrangian matrix. The 0 partition then 
becomes changed to -A'G~!A which is negative definite and so can be factorized 
as -LDL! with D> 0. The resulting factors LDL? of the Lagrangian matrix are 
therefore given by 


E-|\ il: p-|P = (10.2.10) 


where B is defined by LDB = —A and is readily computed by forward substitution. 
To avoid loss of precision in forming A'G~!A (the ‘squaring’ effect), L and D are 
best calculated by forming D‘/?L™A and using the QR method to factorize this 
matrix into QD'/?L7. This method is most advantageous when G has some struc- 
ture (for example a band matrix) which can be exploited and when the number of 
constraints m is small (see Question 10.3). It is not entirely obvious that the 
method is stable with regard to round-off error, especially when G is nearly singu- 
lar, but the method has been used successfully. The requirement that G is positive 
definite is important, however, in that otherwise computation of the LDL! factors 
may not be possible or may be unstable. The method is closely related to the dual 
transformation described in (9.5.8) and (9.5.9) in the case that the constraints are 
equalities. 


10.3 The Active Set Method 


Most QP problems involve inequality constraints and so can be expressed in the 
form given in (10.1.1). This section describes how methods for solving equality 
constraints can be generalized to handle the inequality problem by means of an 
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active set method similar to that described for LP problems in Section 8.3. It is 
convenient to consider first the case in which the Hessian matrix G is positive 
definite which ensures that any solution is a unique global minimizer, and that some 
potential difficulties are avoided. However it is possible with care to handle the 
more general case which is described in Section 10.4. 

In the active set method certain constraints, indexed by the active set %W, are 
regarded as equalities whilst the rest are temporarily disregarded, and the method 
adjusts this set in order to identify the correct active constraints at the solution to 
(10.1.1). Because the objective function is quadratic there may be m (0<m <n) 
active constraints, in contrast to linear programming when m = n. On iteration k a 
feasible point x‘*) is known which satisfies the active constraints as equalities, that 
is apx*) = b;, i€ M Also except in degenerate cases a x(*) > b;, i € L, so that 
the current active set . is equivalent to the set of active constraints.) defined 
in (7.1.2). Each iteration attempts to locate the solution to an equality problem 
(EP) in which only the active constraints occur. This is most conveniently done by 
shifting the origin to x‘*) and looking for a correction 6“) which solves 


minimize $6'™G6 + 57g) 
(10.3.1) 
subject to a7 § = 0, REA A 


where g“*) is defined by g“*) = g + Gx) and is Vq(x) for the function g(x) 
defined in (10.1.1). This of course is basically in the form of (10.1.2) so can be 
solved by any of the methods of Sections 10.1 and 10.2. If 6) is feasible with 
regard to the constraints not in .%, then the next iterate is taken as x(**1) = 
x) + §(*). If not then a line search is made in the direction of 6“) to find the 
best feasible point as described in (8.3.6). This can be expressed by defining the 
solution of (10.3.1) asa search direction s“), and choosing the step a‘*) to solve 


b; —alx*) 
a=min(1, min ——— (10.3.2) 


so that x**1) = x) + gs) If a) < 1 in (10.3.2) then a new constraint 
becomes active, defined by the index (p say) which achieves the min in (10.3.2), 
and this index is added to the active set %. 

If x) (that is § = 0) solves the current EP (10.3.1) then it is possible to com- 
pute multipliers (A say) for the active constraints as described in Sections 10.1 
and 10.2. The vectors x‘*) and A“) then satisfy all the first order conditions 
(9.1.16) for the inequality constraint problem except possibly that A; 20,i€/. 
Thus a test is made to determine whether i) >0 Vie ¥™ OT: if so then the 
first order conditions are satisfied and these are sufficient to ensure a global sol- 
ution since q(x) is convex. Otherwise there exists an index, g say, qe vw NJ, 
such that nf) <0. In this case, following the discussion in Section 9.1, it is poss- 
ible to reduce q(x) by allowing the constraint g to become inactive. Thus q is 
removed from .% and the algorithm continues as before by solving the resulting 
EP (10.3.1). If there is more than one index for which i) <0 then it is usual to 
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select q to solve 


min ){*), , (10.3.3) 
ieSni * 

This selection works quite well and is convenient, so is usually used. One slight dis- 
advantage is that it is not invariant to scaling of the constraints so some attention to 
scaling may be required. An invariant but more complex test can be devised based 
on the expected reduction in g(x) (Fletcher and Jackson, 1974). To summarize the 
algorithm, therefore, if x) isa given feasible point and “= f) is the corre- 
sponding active set then the active set method is defined as follows. 


(a) Given x“) and .% set k= 1. 

(b) If & = 0 does not solve (10.3.1), go to (d). 

(c) Compute Lagrange multipliers 1%) and solve (10,323); 4f- rf) 20 
terminate with x* = x*), otherwise remove q from .~%. 

(d) Solve (10.3.1) for s*). 

(e) Find a*) to solve (10.3.2) and set x(**1) = x(*) + alk gk), 

(f) Ifal®) <1, add p to” 

(g) Setk=k +1 and go to (b). 


(10.3.4) 


An illustration of the method for a simple QP problem is shown in Figure 10.3.1. In 
this QP the constraints are the bounds x > 0 and a general constraint a'x > b. At 
x“) the bounds x, > 0 and x, > 0 are both active and so x“) (that is 6 = 0) solves 
the current EP (10.3.2). Calculating 4°!) shows that the constraint x2 20 has 
negative multiplier and so becomes inactive, so that only x; = 0 is active. The corre- 
sponding EP is now solved by x‘?) (or § = x2) — x()) which is feasible and so 
becomes the next iterate. Calculating 1?” indicates that the constraint x; > 0 has 
a negative multiplier so this too becomes inactive leaving .% empty. The corre- 
sponding EP is now solved by x’ which is infeasible, so the search direction 

s(2) = x’ — x(?) is calculated from (10.3.1) and a line search along s°?) by solving 
(10.3.2) gives x) as the best feasible point. The general constraint a?x > b 
becomes active at x?) so is added to % Since x) does not solve the current EP, 
multipliers are not calculated, but instead the EP is solved, yielding x‘*) as the next 


AG) 





Figure 10.3.1 The active set method 
for a simple QP problem 
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iterate. Calculation of multipliers at x“) indicates that x“) is the solution of the 
inequality QP problem, and the algorithm terminates. A numerical example with 
similar properties is given in Question 10.4. 

Termination of the algorithm in general can be proved if each step a*) # 0, in 
which case g(x) is reduced on each iteration. It is a consequence of (10.3.2) (see 
Question 10.10) that the vectors a;, i € .% are independent so the EP in (10.3.1) 
is always well defined. The termination proof relies on the fact that there is a 
subsequence {x‘*)} of iterates which solve the current EP. (Only when a) <1 in 
(10.3.2) is x**") not a solution to the EP. In this case an index p is added to wv. 
This can happen at most n times in which case x) is then a vertex, so solves the 
corresponding EP.) Since the number of possible EPs is finite, since each x‘*) in the 
subsequence is the unique global solution of an EP, and since g(x *)) is monot- 
onically decreasing, it follows that termination must occur. This proof fails when 
steps of zero length are taken and q(x‘*)) does not decrease, in which case the 
algorithm can cycle by returning to a previous active set in the sequence. This is 
caused by degeneracy in the constraint set and the situation is similar to that 
described in Section 8.6. There is the additional possibility of a tie occurring when 
the value af*) = 1 (for which s“) solves (10.3.1)) is also the value for which in- 
active constraints become active in (8.3.6). It is possible to give perturbation results 
which enable these ties to be broken in such a way as to theoretically avoid cycling, 
as in Section 8.6, but in practice there are some difficulties (again see Section 8.6) 
and most current algorithms ignore the fact that degeneracy can occur. 

An important feature of an active set method concerns efficient solution of the 
EP (10.3.1) when changes are made to .% As in linear programming it is possible to 
avoid refactorizing the Lagrangian matrix on each iteration (requiring O(n*) house- 
keeping operations) and instead to update the factors whenever a change is made to 
f (O(n’) operations). Some care has to be taken in the different methods in 
choosing how the various factors are represented to ensure that this updating is 
possible and efficient. These computations are quite intricate to write down and 
explain, even for just one method, so I shall not give details. Different ideas are 
used in different methods; amongst these are simplex-type updates and the idea for 
representing QR factors of A without requiring Q, described in Section 8.5. Use of 
rank one updates and use of elementary Givens rotations often occur (Fletcher and 
Powell, 1975; Gill and Murray, 1978a). Permutations (interchange or cyclic) in the 
constraint or variable orderings are often required to achieve a favourable structure 
(for example Fletcher and Jackson, 1974). A review of a number of these tech- 
niques is given by Gill and Murray (1978a). A description of the details in the case 
that the orthogonal factorization method in Section 10.1 is used to solve the EP is 
given by Gill and Murray (1978b). Another important feature of a similar nature is 
that the algorithms should treat bounds (when a; = +e;) in an efficient way. For 
instance if the first p (Km) active constraints are lower bounds on the variables 
1 to p, then the matrices A, S, and Z have the structure 


lee a ie 0 [2] 
z 2 Z= 
= F Re pout > |_sat st | > Z) 


Dim—p p m—p n—m 
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where SJ is any left generalized inverse for A> (SSA, =I). This structure cuts down 
storage and operations requirements but makes the updating more complicated as 
there are four cases to consider (adding or removing either a bound or a general 
constraint to or from .%). 

The active set method requires an initial feasible point x). this can be calcu- 
lated by the techniques described in Section 8.4, solving the auxiliary problem 
(8.4.6). This calculates a feasible vertex (possibly including some pseudoconstraints), 
which becomes the required feasible point x). Factors of the A matrix for the 
active set can also be passed on to be used in the main algorithm. An alternative 
possibility which avoids the feasible point requirement is to bias the phase I cost 
function by adding to it a multiple vg(x) of the quadratic function. For sufficiently 
small vp, the problem can then be solved in a single phase. This leads to a study of 
QP-like methods for the problem 


minimize p(x) 4 q(x) + Il &x)* Il, (10.3.5) 


(for inequality constraints 2(x) < 0), where &(x) = A?x — b and a* = max(a, 0). 

An active set method is also possible which differs from (10.3.4) in a number of 
ways. One is that the line search (10.3.2) is changed to search for a minimizer of 
(10.3.5): if a constraint 2, (x) becomes active (zero) then it is added tow as before. 
Another difference is that the first order conditions for (10.3.5) are that multipliers 
4 exist which satisfy 0 <A; < 1 (see (14.3.8) and later in Section 14.3). Thus test 
(10.3.3) is changed in an obvious way to choose A, as that multiplier which violates 
these conditions the most. Likewise inactive constraints have their corresponding 
multiplier either 0 or 1 according to whether or not 2;(x) <0 or > 0. The EP which 
is solved is 


minimize g(x)+ Z A,&;(x) 
x i¢ LH 


subject to 2;(x) = 0, iE LY, (10.3.6) 


and the techniques of Sections 10.1 and 10.2 are again relevant, together with suit- 
able updating methods. 

The relationship between the active set method (10.3.4) and some methods 
derived as extensions of linear programming is described in Section 10.6. Possible 
variations on (10.3.4) which have been suggested include the removal of more than 
one constraint (with a <0) from .% in step (c) (Goldfarb, 1972), or the modifi- 
cation of step (b) so as to accept an approximate solution to the EP. Both these 
possibilities tend to induce smaller active sets. Goldfarb (1972) argues that this is 
advantageous, although no extensive experience is available and I would expect the 
amount of improvement to be small. The modification in which more than one 
constraint is removed also creates a potential difficulty for indefinite QP associated 
with the need to maintain stable factors of the curvature matrix Z'GZ. 


10.4 Advanced Features 


A modern code for QP should be of wider application than has so far been con- 
sidered; it should be able to handle an indefinite QP problem, and it should enable 
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the user to carry forward efficiently information from a previous QP calculation 
when a sequence of problems is being solved. These extensions are considered in 
this section. Indefinite QP refers to the case in which the Hessian G is not required 
to be positive definite, so will often be indefinite, although the negative and posi- 
tive semi-definite cases are also included. For equality constrained QP no diffi- 
culties arise and a unique global solution exists if and only if the reduced Hessian 
matrix Z'GZ is positive definite. However when inequalities are present, and 
excluding the positive semi-definite case, it is possible for local solutions to exist 
which are not global. The problem of global optimization, here as elsewhere in the 
book, is usually very difficult so the algorithms only aim to find a local solution. 
In fact a modification of the active set method is used which only guarantees to 
find a KT point for which the reduced Hessian Z'GZ is positive definite. If 

dj > 0, i€/*, then these conditions are sufficient for a local solution. However 
if there exists \j' = 0, i€/*, then it is possible that the method will find a KT 
point which is not a local solution. In this case there usually exist arbitrarily small 
perturbations of the problem, (a) which make the KT point a local solution or 

(b) which make the point no longer a KT point so that the active set method con- 
tinues to find a better local solution. Both these features are illustrated in Question 
10.9. However although it is theoretically possible for these situations to arise, in 
practice (especially with round-off errors) it is unlikely and the algorithm is usually 
successful in finding a local (and often global) solution. 

The main difficulty in solving indefinite QP problems arises in a different way. 
The possibility exists for an arbitrary choice of .% that the resulting reduced 
Hessian matrix Z'GZ may be indefinite so that its LDL’ factors may not exist or 
may be unstable in regard to round-off propagation. (Of course if G is positive defi- 
nite then so is Z'GZ and no problems arise.) To some extent the active set method 
avoids the difficulty. Initially x“) is a vertex so the null space (10.1.9) has dimen- 
sion zero and the problem does not arise. Also if Z'GZ is positive definite for sorme 
given .Y and a constraint is added to .% then the new null space is a subset of the 
previous one, so the new matrix Z'GZ of one less dimension is also positive definite. 

The difficulty arises therefore in step (c) of (10.3.4). In this case x‘*) solves the 
current EP and Z!GZ is positive definite. However when a constraint index q is 
removed from .%, the dimension of the null space is increased and an extra column 
is adjoined to Z. This causes an extra row and column to be adjoined to Z'GZ, and 
it is possible that the resulting matrix is not positive definite. Computing the new 
LDL! factors is just the usual step for extending a Choleski factorization, but when 
Z'GZ is not positive definite the new element of D is either negative or zero. (The 
latter case corresponds to a singular Lagrangian matrix.) This implies that the correc- 
tion 6) which is a stationary point of (10.3.1) is no longer a minimizer, so some 
thought must be given to the choice of the search direction s“*) in step (d) of 
(10.3.4). Since x*) is not a KT point, feasible descent directions exist and any such 
direction can be chosen for s‘*). One possibility is to consider the line caused by 
increasing the slack on constraint q at x). Thus bg is changed to bg + a and from 
(10.2.4) the resulting search direction is s(*) = Te, , the column of T corresponding 
to constraint q. In fact when the updated D has a negative element this is equiv- 
alent to choosing s‘*) = —8) where 8) is the stationary point of (10.3.1) for the 
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updated problem. In this case the unit step a(*) = 1 in (10.3.2) no longer has 
significance. Thus the required modification to (10.3.4) when D has a negative 
element is to choose s‘*) = — 6*) in step (d), and to solve (8.3.6) rather than 
(10.3.2) to obtain a*) in step (e). If D has a zero element it can be replaced by 
—e where € > 0 is negligible. This causes s“ to be very large but the length of 
s‘*) is unimportant in solving (8.3.6). It may be that (8.3.6) has no solution; this 
is the way in which an unbounded solution of an indefinite QP is indicated. 

It can be seen therefore that in exact arithmetic, with the above changes, there 
is no real difficulty at all. The situation in which D has one negative element is only 
temporary since the algorithm proceeds to add constraints, reducing the dimension 
of the null space, until the solution of an EP is found, in which case the matrix 
Z'GZ and hence D is positive definite. No further forward extension of the LDL! 
factorization is made which might break down. The only adverse possibility is that 
there is no bound on the possible size of the negative element of D. Thus when this 
element is computed the resulting round-off error can be significant. Therefore Gill 
and Murray (1978b) suggest that this possibility is monitored, and if a large negative 
element is observed, then the factors are recomputed at a later stage. 

An alternative possibility which avoids this round-off problem is the following. 
When a constraint q is determined as in step (c), then it can be left in as a pseudo- 
constraint. The right hand side bg is allowed to increase (implicitly) to bg + a and 
the search direction s‘*) = Te, described above is followed. If a new constraint p 
becomes active in the line search, then p is added to .% and the process is repeated. 
Only when the solution of an EP is located in a subsequent line search is the index 
q removed from .% This approach has the advantage that all the reduced Hessian 
matrices Z'GZ which are computed are positive definite. The disadvantage is that 
additional degenerate situations arise whei the incoming vector a, in step (f) of 
(10.3.4) is dependent on the vectors a;, i€ U q. This arises most obviously when 
moving from one vertex to another, as in linear programming. A possible remedy is 
to have formulae for updating the factors of the Lagrangian matrix when an 
exchange of constraint indices in .x is made (as in the simplex method for LP). An 
idea of this type is used by Fletcher (1971) and should be practicable with more 
stable methods but has not, I believe, been investigated. 

Another important feature of a QP code is that it should allow information to be 
passed forward from a previous QP calculation when the problem has only been 
perturbed by a small amount. This is an example of parametric programming (see 
Section 8.3 for the LP case). In this case it can be advantageous to solve an initial 
EP with the active set .” from the previous QP calculation. The resulting point x“) 
will often solve the new QP protlem. If A is unchanged then Z need not be com- 
puted in most methods, and if G is also unchanged then Z'GZ does not change so 
its factors need not be recomputed. There are, however, some potential difficulties. 
If A changes it may be that it comes close to losing rank, so that round-off diffi- 
culties occur. In indefinite QP problems, if A or G change it may be that Z'GZ is 
no longer positive definite. In both these cases the option of using the previous 
active set should be abandoned. Another possibility is that the initial solution x“) 
of the EP is infeasible with respect to certain constraints i € .2%. In this case it is 
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preferable not to have to resort to trying to find a feasible vertex as in Section 8.4. 
A better idea is to replace the right hand sides b; of the violated constraints by 

b; + @ and to trace the solution of the QP problem as @ J 0. An alternative approach 
is based on using an exact penalty function as described at the end of Section 8.4 
and in (10.3.5). 

Another aspect of parametric programming is in finding the sensitivity of the 
solution with respect to changes in g and b. The resulting rate of change of x* and 
2* can be obtained by substituting in (10.2.4) and (10.2.5), and the effect on 
q(x”) by using the definition of q(x) in (10.1.1). Notice however that dg/d3,; is 
approximated by dj by virtue of (9.1 .10). These estimates may be invalid if there 
exist multipliers for which Af = 0,1 €/*. 


10.5 Special QP Problems 


In some cases a QP problem has a special structure which can be utilized in order to 
gain improved efficiency or stability. A number of these possibilities are described 
in this section. An important case arises when the only constraints in the problem 
are the bounds 


Oy ea ete ney a a eeeee (10.5.1) 


where any u; or —2; may be © and 2; <u;. Thus the vectors a; are +e; and it is 
possible to express the EP (after permuting the variables so that the active bounds 
are those on the first m variables) with 


I|m O0j}m 
a=s-(5]" z=v=|0|" (10.5.2) 
and Z'GZ = G5, (see the definition before (10.1.3)). Thus no calculations on Z are 
required and this can be exploited in a special purpose method. The only updating 
which is required is to the LDL? factors of Z'GZ. Also from (10.1.14) the multi- 
pliers 1) at the solution to an EP are just the corresponding elements of g*) by 
virtue of the above S. Another feature of this problem is that degeneracy in the 
constraints cannot arise. Also there are no difficulties in dealing with indefinite QP 
using the second method (based on pseudoconstraints) described in Section 10.4. 
The only exchange formula which is needed is that for replacing a lower bound on 
x; by its upper bound, or vice versa, which is trivial. Pseudoconstraints (Section 8.4) 
are also important for calculating an initial vertex for the problem, especially if 
u; and —Q; are large. Methods for this problem are given by Gill and Murray (1976b) 
and by Fletcher and Jackson (1974). The latter method updates a partial factor- 
Go. Gi2 
Go, Gi 
merely the LDL! factors of G) as mentioned above. 

Another special case of QP arises when G is a positive definite diagonal matrix. 
By scaling the variables the objective function can be written with G = I as 


q(x) = 4x'x + g'x. (10.5.3) 


ization of the matrix | ; it now seems preferable to me to update 
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The resulting problem is known as a least distance problem since the solution x* in 
(10.1.1) can be interpreted as the feasible point which is the least distance (in L2) 
from the point x = —g (after writing (10.5.3) as $Ilx +g 3 and ignoring a constant 
term). Again a special purpose method is appropriate since the choice Z = Q) is 
advantageous in that Z'GZ = Z'Z =I. Thus no special provision need be made for 
updating factors of Z'GZ and only the updating operations on Z are required, for 
which the formulation of Gill and Murray (1978b) can be used. Since G is positive 
definite, no special difficulties arise. 

Yet another special case of QP occurs when the problem is to find a best least 
squares solution of the linear equations 


Bx =c (10.5.4) 


where B isp x n and c € IR”, subject to the constraints in (10.1.1). In this case the 
objective function is the sum of squares of residuals defined by 


q(x) = (Bx — c)'(Bx — c). (10.5.5) 


It is possible to proceed as in the standard method with G = B'B and g = —B'c, 
after dividing by 2 and omitting the constant term. However it is well known that 
to form B'B explicitly in a linear least squares calculation causes the ‘squaring 
effect’ to occur which can cause the loss of twice as many significant figures as are 
necessary. It is therefore again advantageous to have a special purpose routine 
which avoids the difficulty. This can be done by updating QR factors of the matrix 
BZ. That is to say, 


BZ =| 9 =P,U (10.5.6) 


is defined where P is p x p orthogonal, U is (n — m) x (n — m) upper triangular, and 
P, is the first n — m columns of P. It follows that U'U is the Choleski factorization 
of Z'GZ (= Z'B'BZ) and can thus be used to solve (10.1.3) or (10.3.1). An algor- 
ithm is described by Mifflin (1978). 

Finally the special case of least squares QP subject only to bounds on the vari- 
ables is considered. It is possible to introduce a further special purpose method, 
similar to those described above in which, using (10.5.2) and (10.5.6), B is written 
[B, : B,] and P,U factors of B, are used to solve the EP. An alternative possi- 
bility is to transform the problem into a least distance problem using the Wolfe 
dual (theorem 9.5.1). For example the QP 


minimize +x'B' Bx + g!x 
xX 


(10.5.7) 
subject tox 20 
becomes 
maximize }x'B'Bx + g'x — 47x (10.5.8) 
X,A 
subject to B'Bx +g — A =0 (10.5.9) 


120 (10.5.10) 
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using (9.5.1). Eliminating 4 from (10.5.8) using (10.5.9) and writing Bx = u yields 
the least distance problem 


minimize }ulu 
‘ (10.5.11) 


subject to B'u> ~—g. 


The system can be solved by the method described above giving a solution u*. The 
solution x* of (10.5.7) can then be recovered in the following way. By applying the 
dual transformation (theorem 9.5.1) to (10.5.11) it is easily seen that the dual of 
(10.5.11) is (10.5.7) so that the vector x is the multiplier vector for the constraints 
in (10.5.11). Thus x; = 0 if i corresponds to an inactive constraint at the solution 
of (10.5.11). Furthermore a QR factorization of the columns of B corresponding 
to the active constraints is available (see (10.1.17)) so the remaining elements of 

x™ can be determined from Bx* = u*. Thus bounded least squares problems can be 
reduced to least distance problems, and vice versa. This result is a special case of the 
duality between (9.5.8) and (9.5.9) given in Section 9.5. 


10.6 Complementary Pivoting and Other Methods 


A number of methods for QP have been suggested as extensions of the simplex 
method for LP. The earliest of these is probably the Dantzig—Wolfe algorithm 
(Dantzig, 1963) which solves the KT conditions for the QP 


minimize 4x'Gx + g'x 
x (10.6.1) 
subject to A!x > b, x20. 


Introducing multipliers y for the constraints A'x > b and u for the bounds x > 0 
gives the Lagrangian function 


Y(x, y, u) = 4x'Gx + g!x — y(A'x — b)— u'x. (10.6.2) 
Defining slack variables v = Ax — b, the KT conditions (9.1.16) become 

u— Gx + Ay=g 

v—A'!x=-—b (10.6.3) 

u,v,x,y>0, wiv=0, xly=0. 


The method assumes that G is positive definite which is a sufficient condition that 

any solution to (10.6.3) solves (10.6.1). The system (10.6.3) can be expressed in 

the form 
w—Mz=q 


(10.6.4) 
w2 0, z>0, w'z=0 


where 
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This has come to be referred to as a linear complementarity problem and there are 
other problems in game theory and boundary value calculations which can also be 
expressed in this way. The Dantzig—Wolfe method for solving (10.6.3) has become 
known as the principal pivoting method for solving (10.6.5) which is outlined 
below. However all methods for solving (10.6.4) have some features in common. 
They carry out row operations on the equations in (10.6.4) in a closely related way 
to LP (Section 8.2). Thus the variables w and z together are rearranged into basic 
(B) and nonbasic (N) variables xg and xy. At a general stage, a tableau represen- 
tation 


(1: Al (ea = (10.6.6) 
XN 

of the equations in (10.6.4) is available and the variables have the values xg = b 

and xy = 0. A variable x, , q © N, is chosen to be increased and the effect on xg 

from (10.6.6) is given by 


ee Sy ee (10.6.7) 


where a, is the column of A corresponding to variable x,. An element x;,i¢ B, 
becomes zero therefore when x, = b;/a@jq, and this is used in a test similar to 
(8.2.11). This determines some variable x, which has become zero, and so p and g 
are interchanged between B and N as in LP. The tableau is then rearranged by 
making row operations which reduce the old x, column of the tableau to a unit 
vector (the pivot step of LP — see Section 8.2 and Table 8.2.1). The algorithms 
terminate when b > 0 and when the solution is coniplementary, that is Zao 
w; © N andz;€ Nw; €B, for all i. 

The principal pivoting method is initialized by xg = w, xy =z, A=—M, and 
b = q. This is complementary so the algorithm terminates if b > 0. If not, there is 
an element w, © B for which bg <0 (the most negative is chosen if more than one 
exists). The complementary variable z, € N is then chosen to be increased. The 
effect on the basic variables is observed by virtue of (10.6.7). As long as any z; 
variables in B stay non-negative, then zg is increased until wg t 0. Thenz, and wg 
are interchanged in B and N and the tableau is updated; the resulting tableau is 
again complementary so the iteration can be repeated. It may be, however, that 
when Zz, is increased then a basic variable z, | 0. In this case a pivot interchange 
is made to give a non-complementary tableau. On the next iteration the comp- 
lement w, is chosen to be increased. If this causes w, t 0 then complementarity 
is restored and the algorithm can proceed as described above. However a different 
Zp may become Zero in which case the same operation of increasing the comp- 
lement is repeated. Since there are only a finite number of z;, i € B, ultimately 
complementarity is always restored. An example of the method applied to the 
problem of Question 10.4 is given in Table 10.6.1. It is not difficult along the 
lines of Question 8.17 to show that the principal pivoting (or Dantzig—Wolfe 
method) is equivalent to the active set method (10.3.4). A complementary sol- 
ution is equivalent to the solution of an EP, and choosing the most negative Wg 
is analogous to (10.3.3). The case when Wg t 0 is the same as choosing alk) = | 
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in (10.3.2), and the case z, | 0 is equivalent to a*) < 1 when an inactive con- 
straint becomes active in the line search. These features can all be observed in 
Table 10.6.1. Solving the problem in the tableau form has the advantage of sim- 
plicity for a small problem and can give a compact code for microcomputer appli- 
cations. Otherwise, however, the tableau form is disadvantageous; it does not 
exploit efficient factorizations of the Lagrangian matrix and instead updates 
essentially an inverse matrix which can be numerically unstable. Slack variables 
for all the inequalities are required so that the tableau can be large. Only a 
restricted form (10.6.1) of the QP problem is solved and the method may fail 
for indefinite QP. Thus in general a modern code for the active set method is 
preferable. 

A different algorithm for solving the linear complementarity problem (10.6.4) is 
due to Lemke (see the review of Cottle and Dantzig, 1968). In this method the 
tableau is extended by adding an extra variable zo and a corresponding extra column 
—e=—(l1, 1, , 1)" to A. Problem (10.6.4) is initialized by xg = w, Xv = (Zo, Z), 
A= (-e, “M), ies q, and the values of the variables are xy = 0 and xg = b. Ifb>0 
then this is the solution; otherwise Zo is increased in (10.6.6) until all the variables 
w are non-negative. The algorithm now attempts to reduce Zg down to zero whilst 
retaining feasibility in (10.6.4). The effect of the initial step is to drive a variable, 


Table 10.6.1 Tableaux for the principal pivoting 
(Dantzig—Wolfe) method 
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Wp Say, to Zero (corresponding to the most negative q;). Then w, and Zo are inter- 
changed in the pivot step and the complement z, is increased at the start of the 
second iteration. In general the variable being increased and the variable which is 
driven to zero are interchanged at the pivot step. The next variable to be increased 
is then the complement of the one which is driven to zero. The algorithm termi- 
nates when Zg becomes zero, which gives a solution to (10.6.4). An example of the 
method applied to the problem in Question 10.4 is given in Table 10.6.2. An obser- 
vation about the method is that including the z9 term makes each tableau in the 
method a solution of a modified form of problem (10.6.1) in which g is replaced 
by g + Zoe and b by b — Zoe. Thus Lemke’s method can be regarded as a parametric 
programming solution to (10.6.1) in which the solution is traced out as the para- 
meter Zo 4 0. A step in which w, | 0 and then z,, is increased corresponds to an 
active constraint becoming inactive as the parameter changes. Vice versa, a step in 
which z, \ 0 and then w, is increased corresponds to an inactive constraint 
becoming active. A form of Lemke’s algorithm without the Zg variable arises if 
there is some column m, 2 0 inM such that mjg > 0 when g; < 0. Then an initial 


Table 10.6.2 Tableaux for the Lemke pivoting method 
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step to increase Z, is possible which from (10.6.6) makes w > 0. Then the algorithm 
proceeds as above and terminates when complementarity is achieved (see Question 
10.13). 

Another popular early technique was Beale’s method (Beale, 1959) which was 
developed as an extension of linear programming. However the method can also be 
described briefly as an active set method (Goldfarb, private communication) in the 
following way. Consider the active set method for LP in Section 8.3 applied to 
minimize a quadratic function. If the kth line search is terminated by reaching an 
unconstrained minimum, then the difference in gradients y*) = g**!) — g(*) is 
exchanged with the outgoing constraint normal in the matrix A in (8.3.3). After a 
sequence of steps in which constraints are removed from .% the effect is that the 
search direction s‘**1) js orthogonal to the vectors y@), .. . , y) and hence 
conjugate tos“), ... , s*). Thus the possibility of quadratic termination is intro- 
duced (see Volume 1, Section 2.5). Unfortunately if a previously inactive constraint 
then becomes active, the sequence of conjugate directions is broken. To recover 
from this Beale’s method requires some unproductive iterations in which the now 
irrelevant y“) vectors are removed from the A matrix. The possibility of rearrang- 
ing the tableau in order to avoid these iterations has been studied by Benveniste 
(1979). The method is then likely to be closely related to Murray’s (1971) method 
which also recurs conjugate directions. Since essentially an inverse matrix is updated, 
the more stable factorizations described in Section 10.1 are preferred. The relation- 
ship between the active set method, the Dantzig—Wolfe method, Beale’s method, 
and others is discussed more extensively by Goldfarb (1972). 


Questions for Chapter 10 
1. Consider the equality quadratic programming problem 
Bia prey 1 £0 ine 
minimize qg(x)44x™|-1 2 —-1|x+]1] x 
0 —-1 1 \1 
subject to x; + 2x2 + x3 =4. 
Eliminate x, and express the resulting function in the form LyTUy + vy, 
where y = ( ) U is a constant symmetric matrix, and v is a constant vector. 
3 


Hence find the solution x* to the QP. Find the Lagrange multiplier \* of the 
equality constraint. Does x* solve the QP 

minimize q(x) 

subject to x; + 2x2 + x3 24, x20? 


2. Solve the equality constraint QP problem 


minimize 5? — 6,5, +53 — 653 +53 + 25, — 5 
8 
subject to 35; — 62 +63 =0 
Wan 55-6, = 0 
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by eliminating the variables 5, and 6. Exhibit the matrices A, Ay, Az and Z, 
and a matrix S such that STA = I. Hence find the multipliers of the two con- 
straints at the solution. : 

3. Solve the equality constraint QP 


n 
minimize 2 ix? 
i=1 


n 
subject to 2 x;=K (K > 0) 
i=1 
by the method of Lagrange multipliers. Does this solution minimize the func- | 
tion subject to the constraints 


n 
x20 f= leon ane 2 eK? 
l= 


Give the explicit solution of the equality constraint problem when n = 3, 
K = 10. 

4. Starting from x) =) illustrate the steps taken by the active set method to 
solve the QP problem 





Saar 2 
minimize x4 —X 4X27 +x —-—3x, 
x 
subject to x,20 
x, 20 
—X1 — xX, 2-2. 


For simplicity, solve each EP without making a shift of variables. Verify that 
the method is equivalent to the Dantzig—Wolfe method (Table 10.6.1). 

5. Illustrate the matrices V, S, and Z when solving the QP problem in Question 
9.6 by (a) direct elimination, (b) orthogonal factorization (Gill and Murray, 
1974a). 

6. Given matrices A and Z which are n x m and n x (n — m) respectively 
(1 <m<_n) consider how to choose ann x (n — m) matrix V such that 
(10.1.20) holds. Let the matrices be partitioned after the mth row as 


ae es je Zest =| Z1 
a sf car s=[5\/ z-[73 |: 
Show that S, = Ay /(I— AJS,), V; =—A,S!Zz! and V, =Zz!(1— ZIV;) 
define §,, V,, and hence V, in terms of S,. It follows that V is under- 
determined to the extent of an arbitrary choice of the (m — m) x m matrix S). 
Investigate the choice S, = 0. 

7. Let matrices G, A, S, and Z ben x n,n x m,n xm, andn x (n — m) respect- 
ively (m <n), let S'A =I, Z'A =O, and let the matrix [S : Z] be non- 
singular. Show that there exists a unique n x (m — m) matrix V such that 
[A : V]~" = [S : Z]", and show that S'V = 0, Z'V =I, and AS! + VZ" =I. 














10. 


1 


r. 


13; 
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Hence show that (10.2.3) holds where H, T, and U are defined by (10.2.7). 
Show how the problem 


minimize $x!Gx + g'x 
x 


subject to ATx = b 


can be solved by the method of Lagrange multipliers, and use (10.2.3) and 
(10.2.7) to obtain explicit expressions for the solution x* and the associated 
Lagrange multiplier 4*. Verify that these reduce to (10.1 .15) and (10.1.16) 
respectively. 


. Establish equations (10.2.8) and (10.2.9) and show that the matrix H in 


(10.2.3) satisfies H = (PGP)* where P = I— AA*. 


. Consider the indefinite QP problem 


minimize —x,X 
x 
subject to22>x,+xz2> O 
22x, Nini — 2 


and show that if the active set method starts at the vertex x“!) = (1, 1)" then 
the method terminates at a point x’ which is not a local solution to the prob- 
lem. Show that there exist arbitrary small perturbations to the constraints such 
that (a) x’ is a local solution, (b) the algorithm does not terminate at x’ but 
finds the global solution. 

In the active set method assume that the initial active set.) has the property 
that the set of vectors a;, i€.%), are independent. Show using equation 
(10.3.2) that any vector a, which is added to the set is not dependent on the 
other vectors in the set, and hence prove inductively that the independence 
condition is retained. 

If the Lagrangian matrix in (10.2.2) is non-singular, show that A has full rank 
and hence that a matrix Z can be chosen as in (10.1.18) with orthogonal 
columns. If Z'GZ is singular, consider G + vI as v > 0 and show that (10.2.7) 
becomes unbounded whereas (10.2.3) does not. This is a contradiction and 
proves that Z'GZ is non-singular. 

Show that the least distance to a fixed point p of any point x which satisfies 
the constraints A'x > b can be found by solving a certain QP problem. If all 
the constraints are active, show by the method of Lagrange multipliers that 

the problem reduces to the solution of a system of linear equations. In general, 
when there may be inactive constraints, use the Wolfe dual (theorem 9.5.1) to 
show that the optimum multipliers satisfy a QP problem with simple bounds 
only. How does the solution of this problem determine the least distance 
solution? 

State the linear complementarity problem arising from Question 10.4 and 
show that it satisfies the conditions for the modified form of Lemke’s method 
(Section 10.6) to be applicable, without introducing the variable zo. Hence 
solve the problem in this way. 
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14. Show that the problem of finding the point of least Euclidean distance from 
the origin (in IR*) subject to 


can be formulated as a quadratic programming problem. By eliminating x, and | 
xX», calculate a solution to the equality problem in which both constraints are 
active. Is this the solution to the least distance quadratic programming 
problem? 





Chapter 11 


General Linearly Constrained Optimization 


11.1 Equality Constraints 


The next class of problem which it is convenient to consider is that in which the 
objective function is general but the constraints remain linear, that is 


minimize f(x) 

x 
subject to a/x = b;, iCE (Uist) 
alx>b;, iE. 


The methods for handling the linear constraints are largely those used in quadratic 
programming, but the non-quadratic objective function introduces an extra level of 
difficulty. One aspect of this is that in general the problem can no longer be solved 
finitely so the solution x* is obtained in the limit of some iterative sequence {x‘*)}. 
The equality constraint problem (/= @) is considered in this section and is readily 
handled by generalized elimination, so that the main features are essentially those 
relating to unconstrained optimization as described in Volume 1. Methods for 
calculating Lagrange multipliers at the solution of an equality constraint problem 
are also considered at the end of this section. Inequality constraints can be handled 
by an active set method (Section 11.2) which is a generalization of that used in 
quadratic programming. A new decision which occurs is how accurately to solve 
each equality problem which arises: a poor decision in this respect can lead to the 
phenomenon of zigzagging which can slow down the rate of convergence appreci- 
ably. Methods for overcoming this problem are described in Section 11.3. In 
addition, some alternative possibilities for handling inequality constraints by way 
of a reduction to a sequence of quadratic programming problems are described at 
the end of Section 11.2. 
The rest of this section is devoted to finding the solution x* of the equality 
constraint problem 
minimize f(x) 
x CUiet2) 
subject to A'x =b 
in which A isn x mand b € IR™. As in Section 10.1 it is assumed that rank(A) = m. 
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The generalized elimination method of Section 10.1 is used to provide a reduction 
to an unconstrained problem and includes both the elimination of variables method 
and the orthogonal factorization method, amongst others. Thus matrices § and Z 
are introduced such that S'A = I, Z'A = 0, and [S : Z] is non-singular. If the 
current iterate is the feasible point x“) then a general feasible point can be 
expressed as 


x =x) + § =x) + Zy (11.1.3) 


where 6 = Zy isa feasible correction in the null space of A (see (10.1.9) and 
(10.1.10)). Thus an equivalent form of (11.1.2) is to solve the reduced uncon- 
strained problem 


minimize W(y) 4 f(x + Zy). (11.1.4) 
y 
For computational convenience and stability it is usually best to define a new 


reduced function on each iteration as (11.1.4) indicates. By using the chain rule in 
(11.1.3) it follows that 


V,=Z'V, (1.15) 
and so derivatives of (11.1.4) are given by 

V,,W(y) = Z' g(x) (11.1.6) 
which is the reduced gradient vector, and by 

V2W(y)=Z"[G(x)]Z (11.1.7) 


which is the reduced Hessian matrix. 

It is possible to apply any appropriate technique for unconstrained optimization 
described in Volume 1 to solve the reduced problem (11.1.4) and most of the re- 
marks in Chapter 2 about the structure of methods apply here also. From (11.1.6) 
and (11.1.7), sufficient conditions for x* to be optimal are that Z'g* = 0 and 
Z'G*Z is positive definite, and necessary conditions are that Z'g* = 0 or Z'G*Z 
is positive semi-definite. These conditions are readily shown to be equivalent to 
those described in Section 9.1 and 9.3 (see Question 11.1). The initial feasible 
point x“) can be x“) = Sb, which is the closest feasible point to the origin in the 
orthogonal factorization method, or more generally x!) = x’ + S(b — A?x’) which 
is the closest feasible point to any given point x’. As in Chapter 3 it is important to 
start by considering Newton’s method which is based on the quadratic model ob- 
tained by truncating the Taylor series for W(y) about y = 0, that is 


VQ)~ GMa f™+ yi Za dy ZG Zy, (11.1.8) 


The origin in y-space corresponds to the point x) in x-space (see (11.1.3)) and so 
fs gk ) and G) refer to quantities evaluated at x“), The model quadratic 

q* )(y) refers to the function obtained on iteration k. Thus the basic Newton’s 
method chooses y“) to minimize qi* diy). A unique minimizer exists if and only 
if Z'GZ is positive definite and is obtained by making Vq™) = 0, that is by 
solving 
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(Z7G)Z)y = —ZTg) (11.1.9) 


for y = y), in which case the next iterate is x**!) =x + Zy) from (11.1.3). 
If any x“) is sufficiently close to x* then the method converges and the rate is 
second order. However the method can fail to converge, and moreover is undefined 
when Z'G®)Z is not positive definite. Changes to correct these disadvantages 
include the use of a line search and the possibility of modifying the Hessian matrix 
to become positive definite. The latter includes the ideas of restricted step and 
Levenberg—Marquardt methods. All of this development is given in Volume 1, 
Section 3.1 and Chapter 5, and the reader is referred there for further details. One 
point of interest is that a step restriction || y ll, <h“*) can be handled by a modi- 
fication Z'G“)Z + vI for some v > 0 as in Section 5.2 and corresponds to a restric- 
tion Ix — x) |, <n) if the orthogonal factorization method is used to reduce 
the problem. However for other generalized elimination methods this correspon- 
dence does not hold. 

It is often the case that the user is unable or does not wish to supply formulae 
for second derivatives so it is important to consider methods which require only 
first derivative information or even no derivatives at all. When first derivatives are 
available, one possibility is a finite difference Newton method (see Section 3.2). 

In this case the differences are best taken in y-space so that the ith column of the 
reduced Hessian matrix is defined by 


Z? (g(x + z;h) — g yh (11.1.10) 

for some small h, and after symmetrization this matrix is used to replace ZG 7, 
in (11.1.9). Only n — m additional gradient evaluations are required per iteration so 
the method might be useful when n — m is small, especially as a way to estimate an 
initial reduced Hessian approximation. 

However for the reasons described in Section 3.2 it is usually preferable to select 
a quasi-Newton method to solve the reduced problem (11.1.4). Various suggestions 
(see below) have been made but the most satisfactory in view of the development 
given here is due to Gill and Murray (1974c). The reduced Hessian matrix is approxi- 
mated by a positive definite matrix B“) given in factored form 


Z1GOZ ~ BH =LODOL®", (1.1.11) 
The search direction in y-space is then defined by solving 
Boop -7 g)) (i212) 


for p= p), by analogy with (11.1.9), and f(x) is minimized approximately by a 
search along the line x) + as) where s“*) = Zp* ). The conditions for termin- 
ating the line search are those described in Sections 2.4 and 2.6. More in keeping 
with the ‘old-fashioned’ development of Section 3.2 is to represent B“*) by the 
inverse reduced Hessian approximation 


HO) = [BH)}-! = (P°CO7\ (its) 
and to compute s“*) from 


s) = ZHOZ Tg), (11.1.14) 
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In view of the remarks at the end of Section 3.2 it may be that (11.1.13) is no less 
stable than (11.1.11) whilst being somewhat more convenient to use. After each 
iteration H*) (or the factorization of B“ )) is updated to include the additional 
curvature information obtained in the line search. For the reasons described in 
Chapter 3 the BFGS formula is currently preferred. The required quantities y*) 
and 8¢*) (in (3.2.12) say) become differences in reduced gradients and variables, 
thatis y= 7 "(gS g*)) and 6) = y). The initial choice H“) can be any 
positive definite matrix; H@) = J is usually chosen in the absence of any other 
information. The properties described in Section 3.2 relating amongst others to the 
maintaining of positive definite matrices H“) and to quadratic termination (here in 
n — m iterations) also hold good. 

Another possibility for a suitable algorithm is to apply a conjugate gradient 
method (Section 4.1) to the reduced problem. This is likely to be preferable only 
for large problems in which the matrix H) cannot conveniently be stored. In this 
case it is also important to give some thought to storage for the matrices § and Z. 
The best possibility is to define these implicitly by factorizing the matrix [A : V] 
in such a way as to retain sparsity as much as possible. Likewise a choice of V which 
retains sparsity should be made, for example that given in (10.1.21) corresponding 
to elimination of variables. For no-derivative problems the experience of Section 
3.2 again suggests using a quasi-Newton method with difference approximations to 
derivatives. These are best calculated in the space of the reduced variables so that 
the ith element of Zi ge) is estimated by 


(Fx + 2;h)— fh (11.1.15) 


for some small interval h, which requires only n — m additional function evalu- 
ations. Near the solution it is preferable to use the corresponding central difference 
formula 


Lf(x +2;h) — f(x — z;h))n. (11.1.16) 


Conjugate direction set methods as in Section 4.2 are also possible (Buckley, 1975) 
but there is no evidence to suggest that they are preferable. Least squares problems 
(f(x) 4 r(x)'r(x)) with linear constraints can be handled readily: the Jacobian 
matrix in the reduced coordinates is V,, (r(*)") = Z™A) where A= Vir, and 
analogues of the Gauss—Newton method, etc., are obtained by using the estimate 
Z1GZ = 2ZTAM AM" in (11.1.9); the resulting properties are the same as 
those described in Section 6.1. 

Historically the earliest methods for solving (11.1.2) used steepest descent 
searches as typified by Rosen’s (1960) gradient projection method. This is equiv- 
alent to choosing a search direction p‘*) = —Vyy = —Z'g() as the steepest 
descent vector in the reduced coordinates and hence s‘*) = —ZZ"g(*) as the search 
direction in x-space. When Z is defined by the orthogonal factorization method it 
follows that s*) = —Pg‘*) where P= ZZ! =1— AA* isa projection matrix which 
projects into the null space of A. Rosen suggests calculating this matrix although it 
is preferable to use the implicit definition P = ZZ! . The idea for using an active set 
method also derives from Rosen (1960). Quasi-Newton methods for linearly con- 
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strained problems were first considered by Goldfarb (1969) although he updates 
the n x n matrix 


EVA § : (11.1.17) 


which is positive semi-definite with rank n — m and is related to the H matrix in 
(10.2.3). It is easy to show that updating H? by any formula in the Broyden 
family, using reduced differences as above, is equivalent to updating H) by the 
same formula with unreduced differences (Question 11.2). The choice H“!) = I is 
equivalent to choosing H‘!) = ZZ? which is the projection matrix P when the 
orthogonal factorization method defines Z. However computing the search direc- 
tion using su) = HO )g(*) causes problems due to round-off errors in that s) no 
longer remains in the null space of A when x“*) is close to x*, and so the implicit 
representation through (11.1.14) is preferable. Another early quasi-Newton method 
is due to Murtagh and Sargent (1969), and updates estimates of G~! and 
(A'G~1A)~! using the rank one formula, and uses the representation (10.2.6) to 
solve the quadratic/linear approximating subproblem. However neither matrix 
needs to be positive definite and the rank one formula is potentially unstable, so 
again this method would not be recommended. The observation that a variety of 
methods can be classified as generalized elimination methods is due to Fletcher 
(1972a) and the implementation via S and Z matrices to Gill and Murray (1974a). 

Finally it is of interest to consider the computation of the Lagrange multiplier 
vector A* at the solution to (11.1.2). This information is required when the equality 
problem is a subproblem in the active set method, but can also be useful in a sensi- 
tivity analysis as described in Section 9.1. Essentially 2* is defined by g* = A*A* 
and is readily computed by (10.1.14) or (10.1.19). There is the additional compli- 
cation however that x* and hence g* cannot be computed exactly due to the non- 
finite nature of methods for solving (11.1.2). Thus the effect of calculating approxi- 
mate multipliers A“* ata point x") which approximates x* is considered. In 
particular in the active set method x") might be quite a poor approximation to x” 
so the errors involved may not be negligible. The obvious possibility from (10.1.14) 
is to compute AC”) from 


pO) = ST g(*) (11.1.18) 


which can be regarded as a first order estimate of A* since gOS SOG =x"). 
Unlike (10.1.14) the resulting 1) depends on how S is chosen because the 
equations gl*) = Ai“) are no longer consistent. The orthogonal factorization 
method in which §! = A* has a nice interpretation in that it gives the best least 
squares solution of these equations. Another left inverse for A with special prop- 
erties is the matrix T! defined in (10.2.3). Thus another possibility is to compute 
2) from 

409) = TT CF) fet 19) 
T includes curvature information about the problem and it can be shown that the 


resulting A‘*) is a second order estimate of 1* (see (10.2.9) and Question 11.3). 
Given that T is available and that x? is sufficiently close to x*, then (11.1.19) 
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gives a more accurate estimate of 2*. However for reasons of convenience or 
numerical stability it may be preferable to use (11.1.18). 

For no-derivative problems, if differencing is carried out as in (11.1.15) or 
(11.1.16) to estimate reduced derivatives, then no information is available to com- 
pute Lagrange multipliers. In this case an additional m function evaluations are 
required to estimate the Lagrange multipliers nf) by using 


6) = (f(x +s8,h)—f™)/h (11.1.20) 


where s; is the corresponding column of the matrix S. 


11.2 Inequality Constraints 


Most problems of interest involve inequality constraints and so can be expressed as 
in (11.1.1). This section describes how the methods of the previous section can be 
generalized to handle this problem by means of an active set method similar to that 
described in Section 10.3. Some alternative possibilities are also discussed at the 
end of the section. In the active set method certain constraints indexed by the 
active set .»” are regarded as equalities and the rest are temporarily disregarded as 
described in Section 10.3. Each iterate x‘*) is a feasible point and each iteration 
attempts to locate the solution of the EP 


minimize f(x +8) 
5 (Lie2T) 
subject to a} 6 = 0, iE 


obtained by shifting the origin to the point x“), The initial point x!) can be 

found by the methods of Sections 8.4 or 10.3. The solution of (11.2.1) yields a 
search direction s*) according to the method used, and x**!) is taken (ideally) as 
the best feasible point on the line x*) + as*). In quadratic programming there is 

a clear distinction either that x‘**!) solves the EP or that a previously inactive con- 
straint becomes active in the line search. This is not so in the more general case 
(11.1.1) and the minimizer of the EP is only located in the limit of a sequence of 
iterations with the same .% Thus on each iteration a decision must be taken as to 
whether x*) (that is = 0) is an acceptable solution of the EP; if not, then one or 
more iterations are carried out with the same active set. The definition of an accept- 
able solution must be made with some care and is considered in more detail in 
Section 11.3. If x“ is accepted as the solution to the EP, then Lagrange multipliers 
1 are examined to determine whether or not x) is a KT point. If so then the 
iteration terminates; otherwise the index q of an inequality constraint for which 
Ne <0 is least is removed from .% as in (10.3.3). 

The line search for finding the best feasible point in this more general case is also 
not a finite process and is therefore more complicated than (10.3.2). As in Sections 
2.4 and 2.6 of Volume 1 it is necessary to choose conditions which define a range 
of acceptable a-values, and to locate such a value by a combination of interpolation 
and sectioning in such a way that the iteration is guaranteed to terminate. Exactly 
the same holds good here, but there is the additional complication that a must not 
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exceed the value &*), say, defined as in (8.3.6) by 


— aly(*) 
a) = min z in 
iif GY a; 
alslk)<g 


Thus the initial choice for a in the line search, or any a-values determined by 
extrapolation, must be reduced to @“*) if they would otherwise exceed this value. 
It may be that there are no acceptable a-values in the sense of Section 2.6 for 
which a < @*). In this case a*) = a*) is chosen. It is in this way that the value 
a*) in the line search in step (e) of (11.2.2) below is determined. Thus the active 
set method for solving (11.1.1) can be summarized as follows. 


(a) Given x) and w=.7(x)), set k= 1. 
(b) If & = 0 is not an acceptable solution of (10.3.1), go to (d). 
(c) Let rn solve min nit 5 Sy a ET ne) = 0, terminate with 
x* =x) otherwise remove q from x. 
(d) Solve (11.1.1) for s&*. (11.2.2) 
(e) Choose x(*t!) = x) + a) 5(*) as a near best feasible point along 
the line x + as(*), 
(f) Ifa) =a), addp tov. 
(g) Setk=k+1 and go to (b). 


Note that after an index q is removed from .% in step (c) it is assumed that the 
search direction s‘*) which is calculated in step (d) is downhill (s(#) "g(*) <0) and 
strictly feasible with respect to the deleted constraint (s(*) “a, > 0). This can be 
done in most common methods (see Question 11.7). The choice of method for 
solving (11.1.1) in step (d) is otherwise any of the methods considered in Section 
11.1, and depends on what derivative information is available, amongst other 
things. Possible difficulties caused by degeneracy are no less severe than those 
described in Section 10.3 and elsewhere and it is usual, albeit somewhat unsatis- 
factory, to assume that it does not occur. In this case convergence of the algorithm 
depends on how the choice of an acceptable solution in step (b) is made, and this is 
considered in more detail in Section 11.3. 

An important feature of an active set method concerns the efficient solution of 
the EP (11.2.1) when changes are made to .% It is possible to avoid recalculating 
S and Z in O(n?) operations and instead to update these matrices in O(n”) oper- 
ations, taking advantage of the fact that a column is either added to or removed 
from A. Updating of other matrices may also be possible. Again details of these 
computations are not given but the references towards the end of Section 10.3 are 
again relevant. It is interesting however to review the changes to the reduced 
Hessian matrix Z?'G(Z when a change is made to the active set .x, and hence to 
Z. For the basic Newton method G“*) also changes in a general way so no advan- 
tage can be taken of updating techniques. However the situation is more favourable 
when using a quasi-Newton method. First of all consider the case when alk) = al) 
in step (e) of (11.2.2) so that a constraint becomes active. It is possible that the 
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curvature estimate y'6 <0 may occur (see Section 3.2) in which case the matrix 
H“*) must not be updated. The new Z matrix has one fewer column than the old 
matrix; by making a linear transformation of the columns of Z it is possible to 
arrange matters so that column z,,_,, is removed from Z. In this case the new 
matrix B) in (11.1.11) is obtained by removing row and column n — m from the 
old matrix B“*) (after transformation). This is the same operation that is carried 
out in quadratic programming and details are given for example by Gill and 
Murray (1978b). A corresponding formula for updating H™) can also be obtained 
(Question 11.4); in terms of the matrix H™) used in Goldfarb’s method (see 
11.1.17)) the recurrence relation is 


H:=H— 
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suppressing superscript k, where a is the column that is added to A. When a con- 
straint index is removed from .% as in step (c) of (11.2.2), an extra column z,_ +4 
is adjoined to Z. This extends the space of the free variables to one higher dimen- 
sion, and no curvature information is available in the new direction. Hence BH 
must be extended in an arbitrary way: the most convenient way is to assign 


BM 
pt) := ee °| (11.2.4) 


which is in the spirit of an initial choice of B“) = 1. H™ is updated in an anal- 
ogous way and corresponds to an update 


H® =H +2, 4,28 ma (11.2.5) 
in H™), 

It is interesting to consider quadratic termination results for the active set/quasi- 
Newton methods when the above updating formulae are used. If the active set is 
changed c times in all, and m;,i=0,1,...,c, is the number of active constraints 


after the ith change, then an obvious bound for the total number ¢ of exact line 
searches is 


c 
t< & (n—™m)). 
However Powell (1972) shows that this is unduly pessimistic and that a better 
bound is 
PROP ih. 


which shows that introducing arbitrary information when using (11.2.4) does not 
increase the bound by more than one. This result also allows for additional inexact 
line searches to be mixed with exact line searches. It seems that the tighter bound 


t<ctn— max m; 
0<i<c 


also follows from Powell’s paper, which can be important for large almost linear 
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problems. These bounds all beg the question as to whether the algorithm does 
terminate, that is whether there exists a finite c. This is by no means obvious since 
the algorithm can return to a previous active set, so the finiteness of all possible 
active sets does not immediately imply termination. 

Another observation relating to the active set method when using a Levenberg— 
Marquardt parameter is also of interest. It is described in Volume 1, Section 5.2, 
how it is possible to control the iteration in two ways, either using the parameter 
h (algorithm (5.1.6)) or the parameter v“*) (algorithm (5.2.7)). In an active set 
method it is far preferable to control the algorithm using A‘*) (Holt and Fletcher, 
1979), since a suitable value of h{*) is likely to be little affected by changes in 
active set, whereas this is not true for the v‘*) parameter. An interesting example 
of the potential difficulties is described in Question 11.5. 

Finally a method for solving (11.1.1) is described which avoids using the active 
set method. The method is a version of Newton’s method in which at an iterate 
x*) a quadratic Taylor series approximation 


Fx +8) ~ q(8) =f + ST gQ +1 §TEMS (116) 
is made, and the subproblem 


arize q*)(8) 


subject to a} § = b; — ax), iGE (11.2.7) 
ao 6 > by — ax) Let 


derived directly from (11.1.1) is solved. To ensure global convergence a step 
restriction 


lst. <A (11.2.8) 


is used in an algorithm like (5.1.6). The method therefore solves a sequence of 
quadratic programming subproblems with inequality constraints. The method is a 
special case of the SOLVER method described in Section 12.3. A disadvantage is 
that each iteration is relatively expensive and advantage cannot usually be taken of 
the updating techniques of quasi-Newton methods to reduce the overall operation 
count to O(n”) operations per iteration. However the solution of this subproblem 
does enable the correct active set to be determined quickly and also avoids the 
problem of zigzagging (see the remarks at the end of Section 11.3) so may converge 
more rapidly when remote from the solution. Certainly if the function and deriva- 
tives are expensive to compute, this approach can be preferable. A quasi-Newton 
version of the algorithm is given by Fletcher (1972b), although it might be better 
to use as an updating formula the version of the BFGS formula due to Powell 
(1978a) based on (12.3.18). 


11.3 Zigzagging 


A feature which can adversely affect the rate of convergence of any type of method 
for handling inequality constraints is known as zigzagging. Although the set of 
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active constraints. at a local solution is well defined, it may be that the sets 
~) (as defined in (7.1.2)) obtained at the iterates x*) in some method do not 
settle down (so that.2*) =.¢* for all k > K where K is sufficiently large) but 
oscillate between different subsets of the constraints in the problem. For linear 
constraints, this corresponds to zigzagging between different linear manifolds corre- 
sponding to the feasible region in (11.2.1) for different . (see Figure 11.3.1). If 
the active set does settle down (with *) =.* Wy k>K) then the rate of con- 
vergence becomes that for an equality constraint method (Section 11.1) which is 
usually superlinear for most good methods. However if zigzagging occurs then the 
rate can degenerate to being linear with a rate constant which is not small, and in 
some cases the method can fail to converge to a solution. 

For the active set method of algorithm (11.2.2), the likelihood of zigzagging is 
correlated with the test for an acceptable solution implied in step (b). If it is 
possible to solve any EP exactly in a finite number of steps, and if step (b) tests for 
an exact solution to the EP (as in quadratic programming, (10.3.4), step (b)), then 
it is not possible for zigzagging to occur (excluding degenerate cases). This is 
because the algorithm terminates for the reasons described in Section 10.3; once a 
constraint index is removed from the active set .«*) in step (c), then the algorithm 
can no longer return to that active set by virtue of the optimality of x‘) and the 
fact that f*) is monotonically decreasing. However it is not possible to solve any 
EP exactly when f(x) is a general function, and it is inefficient to find the solution 
to high accuracy because the solution to the EP may not correspond to the sol- 
ution of (11.1.1). The opposite possibility therefore is to allow step (c) to be taken 
on every iteration. It is this strategy which is most likely to cause zigzagging, 
especially when the multiplier estimates are poor. An example is given by Wolfe 
(1972) in which zigzagging causes non-convergence to a solution (see Question 
11.6). The example is somewhat pathological because the objective function is not 
C7 and the steepest descent method is used to determine s‘*) in (11.2.2), step (d). 
Nonetheless it has been observed in practice with quasi-Newton methods and 
smooth functions that this strategy can induce a slow rate of linear convergence. 
Therefore what is required on step (b) is to have some compromise between these 





Figure 11.3.1 Zigzagging 
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two extremes which eliminates the possibility of zigzagging and yet which enables 
an active set.) to be changed when it has become likely that it is not the same 
as if: 

Various ad hoc rules to prevent zigzagging are reviewed by Fletcher (1972a) 
and can work well enough in practice. However it seems preferable in view of the 
above discussion to try to eliminate zigzagging by looking at the accuracy to which 
the current EP has been solved. This is done by Rosen (1960), Goldfarb (1969), 
and Murtagh and Sargent (1969). They all suggest similar strategies which are 
typified by, although different to, the following (Fletcher, 1972a). At x") a 
second order estimate of the reduction in f(x) by keeping the same active set is 
given by 


T 
A®) = 1g) 7HOZT SC) (11.3.1) 


by virtue of (11.1.8) and (11.1.9). H™) is positive definite and is either equal to or 
an approximation to the inverse reduced Hessian matrix, as in (11.1.13). If AS”) is a 
second order multiplier estimate, and {*) <0, then the additional function reduc- 
tion obtained by removing the ith constraint index from.) (assuming (i € J) is 





w 
2 eres (11.3.2) 


where ul*) is the diagonal element of the matrix U™ defined by (10.2.3) at x°*), 
This result is valid when u(t) <0 which is implied by (10.2.6) when G is posi- 
tive definite. Hence a possible strategy is to find the integer p which maximizes 
A over the set {i: {*) <0} and then to define x) as an acceptable solution 
(and hence remove the pth constraint) if AM )> A). The motivation for such a 
test is that a greater reduction in f(x) is likely to occur by removing the constraint 
than by not doing so. For a quasi-Newton method a recurrence relation for the 
quantities ul*) can be determined. An advantage of this test is that it is invariant 
under both linear transformations of the variables and scaling of the constraints. 
If G* (or its approximation) is not positive definite, something has to be done to 
generalize the definitions of A) and ACK), One possibility is to define A‘) = ce if 
H™) is not positive definite and to have x*) as not being an acceptable solution. 
Otherwise, if u(*) = 0, then A = ce is defined and p is chosen to minimize i) 
over the set {i: i) <0; Af) = co}, Strategies based on comparing quantities like 
(11.3.1) and (11.3.2) have been used successfully in practice. 

The dependence of the above test on the quantities ulk ) makes it somewhat 
unwieldy and it is preferable to have a more simple test for an acceptable solution 
with good invariance properties. It is also important to consider whether it can be 
proved that the resulting method converges to the solution of (11.1.1) and avoids 
zigzagging. A test which meets these criteria to a large extent can be derived in the 
following way. Let {2(k)}, k= 1, 2,..., be a sequence of integers such that &(k) 
(<k) is the greatest previous iteration index on which a constraint is removed from 
Wf, or 2(k) = 1 if no such index exists. Then a point x*) is acceptable in step (b) if 


AM) < FRM) _ 6), (11.3.3) 


116 


that is if the predicted function reduction for the current ./ is less than the total 
reduction since a constraint was last removed from .%. An alternative possibility is 
to define the right hand side of (11.3.3) as the total function reduction since 
was last changed. The motivation for this test is that when zigzagging occurs the 
right hand side of (11.3.3) goes to zero, which ensures that A“) goes to zero, 
which usually ensures that convergence occurs for the subsequence of points with 
the same... If x‘*) is accepted as a solution of the EP then the decision as to 
which constraint to remove is that given in (11.2.2), step (c). If the constraints are 
prescaled then these tests are invariant under linear transformations of variables and 
constraint scalings. The test must also be incorporated with the termination test 
which is conveniently chosen to be A“) < € where e > 0 is some preset tolerance 
on the accuracy required in f*. The detailed form of steps (b) and (c) is therefore 


(b) if A > fe) _ ¢™ then go to (d); 
(c) (i) calculate 0) from (11.1.18) or (11.1.19) and let 
AC) minimize Mh), FE TN; (11.3.4) 
(ii) if A) > 0 then (if A“) < then terminate else go 
to (d)) else remove g from .%. 


In some cases it is possible to prove that convergence without zigzagging takes 
place when (11.3.4) is used in (11.2.2). Such a result is given below. To some extent 
this depends on the properties of the method used for finding the solution and 
multipliers of the EP. To make the result easy to state, it is assumed that the 
feasible region R is not empty, that f(x) is €? on R, and that the smallest eigen- 
value u,, of the Hessian matrix G(x) satisfies u,,(x) > a > 0 on R, an assumption 
which implies strict convexity of fon R. Newton’s method with line search is used 
and the search can be terminated by any of the conditions described in Section 2.4, 
so that convergence results like theorem 2.4.1 can be applied. It is also assumed for 
any x ER that the vectors a;, i €.2/(x), are independent. In this case either 
(11.1.18) or (11.1.19) can be used to calculate 4); all that is required is that if 
x(*) + x* for some fixed x then A) > 2*, 


Theorem 11.3.1 


Under the above assumptions then the above method converges to the solution of 
(11.1.1) from any initial feasible point x“. In addition if strict complementarity 
holds at x* then no zigzagging can occur and ¥“*) = o7* for all k sufficiently large. 


Proof 


If.“ is constant for all k sufficiently large then x‘*) > x* by theorem 2.4.1 
applied to the reduced problem. Otherwise the integers &(k) > 9; also the assump- 
tions imply that f(x) is bounded below so f(*) | f® and foo) f +0 which 
implies that A“*) > 0 on the subsequence of iterations for which (11.3.3) holds. 
Let x' #x” be two accumulation points (f' =f” =f”) such that x“) > x’ and 
x**1) + x" for a subsequence. Then uS*) > a and s(*) "g(*) <Q imply that x¢**1) 
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is not acceptable with respect to test (2.4.2) in the line search for some sufficiently 
large k, which is a contradiction. Thus the main sequence converges — say 

x") > x” Let K be sufficiently large such that any set.o”(*), k > K, occurs infi- 
nitely often and restrict attention to k > K. For any such set./(*), using positive 
definiteness and independence assumptions, it follows from A*) > 0 that 

Z'g) +0 ona subsequence, and hence that x™ minimizes f(x) on a} x = b;, 

i€ WS) and that g* = 2 ic. (Kai Ni Also aie >)?,1€ “™. Therefore from 
the independence assumptions, g” = Dien eT a; 7 and so A; = 0 for all 


i€U, ¥) — r,.27), Also by continuity and since i) is chosen as in (11.3.4), 
step (c)(i), it follows that A? > 0, iE IN, ““). Hence x”, 2” is a KT point and by 
convexity solves (11.1.1). If strict complementarity \f > 0, i€ *N/, holds then 
the set U,.7#) — r,.) is empty and hence (*) is constant for sufficiently 
large k. Also *) =.a7* follows because otherwise \* = 0, 1€.v* — w(*), is 
implied which is a contradiction. O 


Exactly what can be proved about zigzagging of the active set method under 
weaker assumptions is a question of some interest. 

It is also of interest to examine to what extent algorithms other than the active 
set method can exhibit zigzagging. In particular the algorithm defined by (11.2.7) 
and (11.2.8) can be shown to behave in a desirable way without requiring any anti- 
zigzagging precautions. The following is a brief sketch of this result. In a similar 
manner to theorems 5.1.1 and 5.1.2 it can be shown that the algorithm converges 
to the solution of (11.1.1) and that the restriction (11.2.8) becomes inactive. 

If A**) are the multipliers of the linear constraints in (11.2.7) then by continuity 
it follows that A** 1) + A*; thus if Af > 0, 1€.* 9 Jit follows that. /*) = .y* 
for all k sufficiently large. 


Questions for Chapter 11 


1. Relate the conditions Z'g* = 0 and Z'G*Z positive definite arising from 
(11.1.6) and (11.1.7) to the first and second order conditions of Sections 9.1 
and 9.3 applied to problem (11.1.2) when rank(A) = m. Show that Z'g* = 0 is 
equivalent to g* = A2*. Also show that the columns of Z are a basis for the 
linear space (9.3.4) and hence that (9.3.5) is equivalent to the condition that 
Z'G*Z is positive definite. 

2. Consider any Broyden method with parameters CORT). applied to 
solve the reduced problem (11.1.4) using the positive definite inverse reduced 
Hessian matrix H“) in (11.1.13). Show for the original problem (11.1.2) that 
an equivalent sequence {x} is obtained if the same ¢) are used in a Broyden 
family update of the positive semi-definite matrix H“) defined in (11.1.17), 
when s“) is calculated from s“*) = —H)g(). Assume that the initial matrices 
are related by H‘!) = ZH()z!. 


118 


_ Teh) = x) — x* where x* solves (11.1.2) show that the vector 4) in 


(11.1.18) satisfies 4 = 2* + O( lh? ||) and is therefore a first order estimate 
of 4*. If A) is computed from (11.1.19) show that A) = 2* + OC Ih |”). 
Use a result like (10.2.9) and assume that rank(A) = m, that Z'G*Z is positive 
definite, and that the expansion g*) = g* + G*h™ + OCI h(*) ||?) is valid. 


. Assume that n x n matrices are related by 


~ 1B b 
B= ler | 
and let H= B~! and H= B™!. Show that H and H are related by 
~ H 0 
H — uuu! =| or i 


where either u! = (b'H oe 1) andy = 1/6—- b! Hb) or u= He,, andy =u;,', 
depending whether it is H or H that is known. Hence deduce (11.2.3). 


. For fixed x’, show that adding a linear equality constraint to problem (11.1.2) 


cannot increase the condition number of the reduced Hessian matrix at x’, 
assuming that x’ is feasible in both problems. It might be thought in general 
therefore that adding linear constraints cannot degrade the conditioning of an 
optimization problem. This is not so as the following example shows. Consider 
best least squares data fitting with a sum of exponentials ae*’ + be®*. The 
unknown parameters a, a, b, B must be chosen to best fit some given data. The 
problem usually has a well-defined (albeit ill-conditioned) solution. Consider 
adding the linear constraint a = 6. Then the parameters a and b become under- 
determined and the reduced Hessian matrix is singular at any feasible point. 
Explain this apparent contradiction. 


. Consider the linear constraint problem 


minimize $(x7 — x,x, + x7)?/4 — x, 
x 


subject to x3 <2, x20 


due to Wolfe (1972). Show that the objective function is convex but not €? and 
that the solution is x* = (0, 0, 2)’. Solve the problem from x“) = (0, a, 0)' 
where 0 < a<./2/4 using algorithm (11.2.2). Allow any point to be an accept- 
able solution in step (b) and use the steepest descent method in step (d). Show 
that x) = 4(a, 0, Va)" and hence that, for k > 2, 


fh) = { (0,0,8)' if kis odd 
(a, 0, 6)? if k is even 
where a = (4)*~ 4a and B=4 ee (a/2’)!/?. Hence show that x(*) > 
(0, 0, (1 +4/2)Va)* which is neither optimal nor a KT point. 


. After a constraint index g is removed from .¥ in step (c) of the active set 


method, consider proving that the subsequent search direction s“*) is down- 
hill (s() " g(*) <0) and strictly feasible (s\*) ag > 0). The Z matrix is augmented 
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by a column vector z as Z : = [Z : z] . Show that the required conditions are 
obtained if Z' GZ is positive definite for the new Z matrix. If Z'G‘Z is 
positive definite for the old Z matrix only, show that the choice s(*) = Te, 
described in Section 10.4 has the required properties. In both cases assume that 
a second order multiplier estimate (11.1.19) is used. 

Consider proving the same result for a quasi-Newton method when update 
(11.2.3) is used, when a first order estimate (11.1.18) for A‘*) is used, and when 
the orthogonal factorization method defines S and Z. Show that the column z 
which augments Z is z = ag which i is the column of A*? (= S) corresponding to 
constraint q. Hence show that s(k)™ g(*) <O and s(*)" a, = 1, which satisfies the 
required conditions. 


Chapter 12 


Nonlinear Programming 


12.1 Penalty and Barrier Functions 


Nonlinear programming is the general case of (7.1.1) in which both the objective 
and constraint functions may be non-linear, and is the most difficult of the smooth 
optimization problems. Indeed there is no general agreement on the best approach 
and much research is still to be done. Historically the earliest developments were 
the sequential minimization methods described in Sections 12.1 and 12.2. These 
methods suffer from some computational disadvantages and are not entirely 
efficient. Nonetheless, especially for no-derivative problems in the absence of 
alternative software, the methods of Section 12.2 can still be recommended. Also 
the simplicity of the methods in this section (especially the shortcut method — see 
later) will continue to attract the unsophisticated user. Thus sequential techniques 
still have a useful part to play and are described in some detail. 

When first (and possibly second) derivatives are available, then the Lagrangian 
methods of Section 12.3 have proved to be considerably more efficient, and soft- 
ware is becoming available. To some extent robustness is introduced by using a line 
search in which an exact penalty function is minimized. However this approach can 
fail, and a theoretically viable approach to the problem of inducing global con- 
vergence whilst retaining a rapid rate of convergence is described in Sections 14.3 
and 14.5. This involves the use of a non-differentiable exact penalty function in 
conjunction with a restricted step method. Current research interest in these areas 
is strong, and further developments can be expected. No-derivative methods, pos- 
sibly using finite differencing to obtain derivative estimates, can be expected to 
follow once the best approach has been determined. However it is likely that the 
effort involved in evaluating expressions for first derivatives will pay off in terms 
of the efficiency and reliability of the resulting software. Another idea which has 
attracted a lot of attention is that of a feasible direction method described in 
Section 12.4. It is shown that this is essentially equivalent to a non-linear general- 
ized elimination of variables. There are inherent difficulties, however, in deter- 
mining a fully reliable method. Some other interesting but not currently favoured 
approaches to the solution of nonlinear programming problems are reviewed in 
Section 12.5. To simplify the presentation methods are discussed either in terms 
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of the equality constraint problem 


minimize f(x) 
x 


(12.1.1) 
subject to c(x) =0 
where c(x) is IR” > IR”, or the inequality constraint problem 
minimize f(x) 
x (12.1.2) 


subject to c;(x) = 0, DESMO eee Gat coh 


Usually the generalization to solve the mixed problem (7.1.1) is straightforward. 

When solving a general nonlinear programming problem in which the constraints 
cannot easily be eliminated, it is necessary to balance the aims of reducing the 
objective function and staying inside or close to the feasible region, in order to 
induce global convergence (that is convergence to a local solution from any initial 
approximation). This inevitably leads to the idea of a penalty function which is 
some combination of f and ¢ which enables f to be minimized whilst controlling 
constraint violations (or near constraint violations) by penalizing them. Early 
penalty functions were smooth so as to enable efficient techniques for smooth 
unconstrained optimization to be used: currently non-differentiable penalty func- 
tions are also receiving attention. For the equality problem the earliest penalty 
function (Courant, 1943) is 


(x, 0) = f(x) +40 2 (c{(x))” 
= f (x) + hac(x)* c(x). 


(12.1.3) 





(1) x(10) x(100) 


Figure 12.1.1 Convergence of the Courant penalty 
function 
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The penalty is formed from a sum of squares of constraint violations and the para- 
meter o determines the amount of the penalty. Some graphs of (x, o) are given in 
Figure 12.1.1 for the trivial problem: min x subject to c(x) 4x — 1 =0 for which 
=x + 4o(x — 1)’. If the solution x* = 1 is compared with the points which 
minimize $(x, o), it is clear that x* isa limit point of the latter as 0 > °°. Thus the 
technique of solving a sequence of minimization problems is suggested. This is 
traditionally implemented as follows. 


(i) Choose a fixed sequence {o)} > co, typically {1, 10, [0-10 ee 
(ii) For each o*) find a local minimizer, x(a) say, to 


12.1.4 
min, ¢(x, 0). ( ) 
(iii) Terminate when c(x(o*?)) is sufficiently small. 
The effect of this iteration on the problem 
minimize —x, — x2 (12.1.5) 


subject to 1 — x? — x3 =0 


for which the solution and optimum Lagrange multiplier is x7 =x = A* = 1/72, is 
shown in Table 12.1.1. It can be seen that x(o*)) +> x* and that linear convergence 
is obtained with one extra decimal place being obtained at each iteration. This 
behaviour can in fact be justified for all problems, and this is done in theorem 
12.1.2, equation (12.1.13), below. It must be emphasized that in practice step (ii) 
in (12.1.4) is done numerically, that is by the application of an unconstrained mini- 
mization method. The choice of this method will depend on whether or not deriva- 
tives are available and on the size of the problem (see Volume 1). Often x(o“*?) is 
used as an initial approximation when minimizing $(x, o(** 1), and other infor- 
mation such as inverse Hessian approximations can also be passed forward from one 
iteration to the next. In fact algorithm (12.1.4) is idealized in that step (ii) cannot 
be solved exactly in a finite number of operations. It is assumed that x(o“*?) is 
obtained as accurately as possible: in fact bounds on the accuracy can be given 
(equation (12.1.21) below) which still guarantee convergence. It has been assumed 
that the local minimizer x(o) exists. This may not be so, not only when the non- 
linear programming problem is unbounded, but also in cases when local solutions 
exist. In the latter case a remedy (not guaranteed to work) is to increase the initial 
value o!) and repeat. 

A variety of results relating to the convergence of this sequential penalty func- 
tion can be given. In doing this, quantities derived from o*) like x(o™), f(x(o™)), 
etc., are denoted by x“*), f), etc. It is assumed for the first theorem that f(x) is 
bounded below on the (non-empty) feasible region so that 


f* = inf f(x) VW {x : e(x) = 0} (1221.6) 


exists. If global solutions can be computed in (12.1.4), step (ii), then the most 
simple result is the following. 
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Theorem 12.1.1 (Penalty function convergence) 
If {o*)} + 00 then 


(i) {oC ) o))} is non-decreasing, 
(ii) {c) c(*)} is non-increasing, 
(iii) {f ()} is non-decreasing. 


Also c&*) + 0 and any accumulation point x* of {x} solves (12.1.1). 


Proof 
Let o*) <): then from the definition of x“) and (12.1.3), 
o(x™), DY < o(x®, 0M) < o(x®, o)) = o(x), 0), 
The first two inequalities give case (i). Subtracting the inner and outer inequalities 
gives 
(6 — gee) — c)* 0) > 0 


and hence case (ii). Substituting in the first inequality then gives case (iii). By defi- 


nition of x(*) 


ox o))< inf (x, 0) =F* (21.7) 
x:c(x)=0 
by (12.1.6). Thus using case (iii) above in (12.1.3), o(*) t c° implies ce) > 0. If 
x(*) + x* it follows that c(x*) = 0, so f(x*) > f* as defined in (12.1.6). But from 
(12.1.3) and (12.1.7), f < f* so it follows that f(x*) =f*, that is x* solves 
(211): oO 


It is interesting to observe that this result is obtained in absence of differentiability 
or Kuhn—Tucker regularity assumptions. 

It is also possible to prove similar results when local minima are computed (with 
different assumptions on the problem), and also to get asymptotic estimates of the 
rate of convergence. In doing this the vector 


4) = _ gel) (12.1.8) 


is defined, and can be regarded as a Lagrange multiplier estimate by virtue of 
(12.1.9). The notation h(*) = x) _ x*, a; = Vc;, and g = Vf, etc., as in Chapter 9 
is used. 


Theorem 12.1.2 


If o*) + 0 and x“) > x* is any accumulation point, and if rank A* =m, then x* is 
a KT point and it follows that 


a) = 2* + o(1) (12.1.9) 
cf) = _4*/o() + o(1/0) (12.1.10) 
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ook) T e(k) = y*T r*/o6*) + o(1/o). (2As11) 


Furthermore if second order sufficient conditions (9.3.5) hold at x*, 4* then 


f= GF =f + oH ™ oH + o(1/0) (i2awi2) 

Ore io) 4 o(1/o) (12.1.13) 

where T* is defined by 
w* IAF H* _T* 

bees 0 = [abi sell (12.1.14) 
Proof 
The fact that x‘*) minimizes (12.1.14) implies that 

V(x), o*)) = 9) 4 GHA e(k) = 9 (02-1215) 
and hence from (12.1.8) that 

glk) = ACh) 4() (12.1.16) 


Since rank A* = m it follows that A“*)* exists and is bounded for all k sufficiently 
large and hence that 


AO) = AC* 0) = 9* + 0(1) (120i) 
where A” is defined by ir = ASigs: It also follows from (12.1.16) by continuity 
that g* = A*A*, and from (12.1.17) and (12.1.8) that ce) = —*/o™ + 0(1/o) so 
that c* = 0 in the limit o) > 0, Thus x* satisfies KT conditions (Section 9.1) and 
(12.1.9) and (12.1.10) are established. Equation (12.1.11) follows directly and 


shows that lim ¢™) 4 ¢* = f*. Without assuming second order conditions it is also 
possible to show from a Taylor series for f(x) about x) and (12.1.16) that 


fra Ff —n*6) + oh 1) 
= f) _ p)* 4) + g(t I) 
= 6) — "20 + oh I) 
using a similar Tayior series for c(x). Thus 
6* =o + bo Me") + o(th® Il) (12.1.18) 


from (12.1.4) and (12.1.8). Second order sufficient conditions for the equality con- 
straint problem (9.3.5) and rank A* =m imply that the Lagrangian matrix is non- 
singular at x*, A” so that the inverse in (12.1.14) exists (see Question 12.4). An 
expansion of the Lagrangian function about x*, 2* and using (9.1.7) and (12.1.16) 
gives 


0 we =A*}( oh 
(£0) * bpd 0 Wigs _ ye) + oGmax( nh) |}, WAC) — 2* II). 
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It follows from (12.1.10) and the existence of (12.1.14) that h) = O(1/o) which 
extends (12.1.18) to give (12.1.12). Multiplying on the left by (12.1.14) and using 
(12.1.8) then gives (12.1.13). O 


The convergence of (12.1.9) and (12.1.10) can be observed easily in Table 
12.1.1 and also the convergence of ${*) > f* =/2 which (12.1.11) implies. The 
other results of theorem 12.1.2 can be used in a more sophisticated algorithm. 
Equation (12.1.12) gives a o(1/c) estimate of f* which is better than the O(1/o) 
estimate given by $“*) itself. The asymptotic form of h‘*) in (12.1.13) can be used 
in an extrapolation scheme to estimate x™ (as given by Fiacco and McCormick 
(1968) in the context of barrier functions). These estimates can be used to termin- 
ate the penalty function iteration and also to provide better initial approximations 
when minimizing $(x, o{*?). It is also interesting to observe that the rank assump- 
tion on A* cannot easily be relaxed. For example if (12.1.3) has no feasible point 
then c* £0 must occur and so as o > © it is necessary from (12.1.8) and (12.1.15) 
that Ac) + 0, that is A* has dependent columns. 

This well-developed theoretical background may make it appear that, apart from 
the inefficiency of sequential minimization, the method is a robust one which can 
be used with confidence. In fact this is not true at all and there are severe numerical 
difficulties which arise when the method is used in practice. These are caused by 
the fact that as o(*) > oo, it is increasingly difficult to solve the minimization prob- 
lem in (12.1.4), step (ii). To illustrate this behaviour, the contours of (x, a) in 
(12.1.3) for the problem (12.1.5) are shown in Figure 12.1.2 for increasing values 
of o. For o = 100, it can be seen that whilst the solution x(100) is well determined 
in a radial direction, this is not so tangential to the constraint boundary, so that the 
exact location of x(100) is very difficult to determine numerically. This is expressed 
mathematically by the fact that (for 0< m <n) the Hessian matrix V74(x‘*), o)) 
becomes increasingly ill conditioned as o{*) > 09, This result follows by virtue of 


V2 4(x™), 0) = WO) + (HAH) Ae)? (12.1.19) 
where (12.1.8) is used in the definition 
WO) = 2 F(x), 40). (12.1.20) 


The oAA! term in (12.1.19) has rank m and so there are m eigenvalues of Vo 
which approach © as o(*) + 00, That the remaining eigenvalues are bounded is a 
consequence (see Lancaster, 1969) of the Courant—Fisher theorem. Thus the con- 
dition number of V*¢ approaches °. In practice this shows up in that large values 
of V¢ are obtained whilst the minimization routine is unable to make progress in 
reducing ¢. 

These remarks have implications for the choice of the sequence {o*}. Choosing 
a very large o“*), or increasing o very rapidly in the sequence, gives minimization 
problems which are very difficult to solve accurately. The alternative of choosing 
o) small and increasing o slowly, keeps x) close to the minimizer of p(x, os a) 
and makes it easier to get accurate solutions, but is very inefficient. The typical 
sequence in (12.1.4), step (i), is a trade-off between these two effects. Probably 
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o =100 | 


Figure 12.1.2 Increasing ill-conditioning in penalty functions (contours of 
¢ — dmin in powers of 2) 


o) should be chosen to balance the f and 40c'c terms in (12.1.3) or to minimize 
the magnitude of V¢. This discussion also highlights the fact that it may not be 
possible to make good use of an accurate estimate of x* in algorithm (12.1.4) since 
x) will be remote from x*. In fact many users do not carry out the sequential 
technique at all, but carry out a shortcut method in which they minimize (12.1.3) 
for a single largish value of o. In doing this they must be prepared to accept errors 
in the extent to which first order conditions are satisfied. Such users would be well 
advised to observe the constraint errors, and to estimate both the error in objective 
function using (12.1.12) and the error in the KT condition g*) = A“) 4), in order 
to decide whether these errors are acceptable. If not, then it is quite easy to gcson 
to use a multiplier penalty function as in Section 12.2. However, if derivatives and 
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software permit, a Lagrangian method (Section 12.3) is likely to prove to be more 
efficient in the long run. ; 

As mentioned earlier, algorithm (12.1.4) and theorem 12.1.1 are idealized in 
that they assume exact minimization of the penalty function. In fact it is straight- 
forward to give the same results as in theorem 12.1.2 when approximate minimiz- 
ation is allowed. To do this let the test for terminating the minimization of 
(x, o*) with an approximate minimizer x*) be 


IV ox, o) I< vile I, C2121) 


where v > 0 is pre-set and where ch) = o(x ry etc. With the same assumptions as 
in theorem 12.1.2, and using the fact that there exists a constant a > O such that 


WAM) > alle | (12.1.22) 


for k sufficiently large (see Question 12.5), it follows from (12.1.21) and (12.1.15) 
that 


ric I> Ie + MAM | > pale I — g. 
This shows that c*) > 0. Then (12.1.21) yields 
c= AA) + o(1) 


and the rest of the results in theorem 12.1.2 follow as before. It is interesting to 
relate (12.1.21) to the practical difficulties caused by ill-conditioning, in that 
(12.1.21) definitely limits the extent to which large gradients can be accepted at an 
approximate minimizer of (x, 0). 

A penalty function for the inequality constraint problem (12.1.2) can be given 
in an analogous way to (12.1.3) by 


$(x, 0) = f(x) + oX [min(e; (x), 0)] ’. (2.1.93) 


An illustration is provided by the trivial problem: minx subject tox > 1. Then 
Figure 12.1.1 also illustrates the graph of ¢, except that @ and fare equal forx > 1. 
Likewise if the constraint in (12.1.5) is replaced by 1 — x? — x3 >0 then Figure 
12.1.2 illustrates contours of ¢, except that inside the unit circle the linear con- 
tours of —x,; — x2 must be drawn. These figures illustrate the jump discontinuity in 
the second derivative of (12.1.23) when c = 0 (for example at x*). They also illus- 
trate that x") approaches x* from the infeasible side of the inequality constraints, 
which leads to the term ex terior penalty function for (12.1.23). Exactly similar 
results to those in theorems 12.1.1 and 12.1.2 follow on replacing c{*) by 
min(c{*), 0). 

Another class of sequential minimization techniques is available to solve the 
inequality constraint problem (12.1.2), known as barrier function methods. These 
are characterized by their property of preserving strict constraint feasibility at all 
times, by using a barrier term which is infinite on the constraint boundaries. This 
can be advantageous if the objective function is not defined when the constraints 
are violated. The sequence of minimizers is also feasible, therefore, and hence the 
techniques are sometimes referred to as interior point methods. The two most 
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important cases are the inverse barrier function 
o(x, r= f(x) +rD [ei] | (12.1.24) 
UJ 


(Carroll, 1961) and the logarithmic barrier function 
(x, r) = f(x) — rd log(c;(x)) (12.1.25) 
1 


(Frisch, 1955). As with o in (12.1.3) the coefficient r is used to control the barrier 
function iteration. In this case a sequence {r*)} + 0 is chosen, which ensures that 
the barrier term becomes more and more negligible except close to the boundary. 
Also x(r“) is defined as the minimizer of (x, r*)). Otherwise the procedure is 
the same as that in (12.1.4). Typical graphs of (12.1.24) for a sequence of values 
of r“*) are illustrated in Figure 12.1.3, and it can be seen that x (r*)) > x* as 
r*) + Q. This behaviour can be established rigorously in a similar way to theorem 
12.1.1 (Osborne, 1972). Other features such as estimates of Lagrange multipliers 
and the asymptotic behaviour of h) also follow, which can be used to determine 
a suitable sequence {r(*?} and to allow the use of extrapolation techniques (see 
Questions 12.6 and 12.7). 

Unfortunately barrier function algorithms suffer from the same numerical diffi- 
culties in the limit as the penalty function algorithm does, in particular the badly 
determined nature of x(r“?) tangential to the constraint surface, and the difficulty 
of locating the minimizer due to ill-conditioning and large gradients. Moreover there 
are additional problems which arise. The barrier function is undefined for infeasible 
points, and the simple expedient of setting it to infinity can make the line search 


$(4x,7) 
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Figure 12.1.3 Increasing ill-conditioning in the barrier 
function (12.1.24) 
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inefficient. Also the singularity makes conventional quadratic or cubic inter- 
polations in the line search work less efficiently. Thus special purpose line searches 
are required (Fletcher and McCann, 1969), and the aim of a simple-to-use algorithm 
is lost. Another difficulty is that an initial interior feasible point is required, and 
this in itself is a non-trivial problem involving the strict solution of a set of in- 
equalities. In view of these difficulties and the general inefficiency of sequential 
techniques, barrier functions currently attract little interest. 


12.2 Multiplier Penalty Functions 


The way in which the penalty function (12.1.3) is used to solve (12.1.1) can be 
envisaged as an attempt to create a local minimizer at x* in the limit o{*) > 00 
(see Figures 12.1.1 and 12.1.2). However x* can be made to minimize ¢ for finite 
o by changing the origin of the penalty term. This suggests using the function 


$(x, 0,6) = f(x) + 4 Do;(c{(x) — ;)” 


= f(x) + 3 (c(x) — @)*S(e(x) — 6) (12.2.1) 


(Powell, 1969), where 8, 6 € IR™ and S = diag o;. The parameters 6; correspond to 
shifts of origin and the o; > 0 control the size of the penalty, like o in (12.1.3). For 
the trivial problem: min x subject tox = 1 again, if the correct shift 0 is chosen 
(which depends on o), then it can be observed in Figure 12.2.1 that x* minimizes 
$(x, 0,0). This suggests an algorithm which attempts to locate the optimum value 
of @ whilst keeping o finite and so avoids the ill-conditioning in the limit «> ©. 
This is also illustrated for problem (12.1.5) with o = 1, in which the optimum shift 
is 0 = 1/\/2. The resulting contours are shown in Figure 12.2.2 and the contrast 
with Figure 12.1.2 as o‘*) > 0 can be seen. 





1 (2 


Figure 12.2.1 Multiplier penalty functions 
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Figure 12.2.2 A multiplier penalty function 
(o=1,0= 1/\/2) (contours of  — min in powers of 2) 


In fact it is more convenient to define 
Ay = 9;0;, bea Nh 2 veya 10 (12.2.2) 
and ignore the term 4 Do,6? (independent of x) giving 
(x, A,6) = f(x) — Ate(x) + $e(x)'Se(x) (12.2.3) 


(Hestenes, 1969). There exists a corresponding optimum value of 4 for which x* 
minimizes $(x, 2, 6), which in fact is the Lagrange multiplier 4* at the solution to 
(12.1.1). This result is now true independent of a, so it is usually convenient to 
ignore the dependence on o and write ¢(x, 2) in (12.2.3), and to use A as the 
control parameter in a sequential minimization algorithm as follows. 


(i) Determine a sequence {A“*)} > 1%. 
(ii) For each 4) find a local minimizer, x(a‘) say, to 
minx $(x, A). 
(iii) Terminate when c(x(A“?)) is sufficiently small. 


(12.2.4) 


The main difference between this algorithm and (12.1.4) is that 4* is not known in 
advance so the sequence in step (i) cannot be predetermined. However it is shown 
below how such a sequence can be constructed. Because (12.2.3) is obtained from 
(12.1.3) by adding a multiplier term — 4c, (12.2.3) is often referred to as a multi- 
plier penalty function. Alternatively (12.2.3) is the Lagrangian function (9.1.6) in 
which the objective f is augmented by the term 4 el§c. Hence the term augmented 
Lagrangian function is also used to describe (12.2.3). 

The result that 2” is the optimum choice of the control parameter vector in 
(12.2.3) is expressed in the following. 
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Theorem 12.2.1 


If second order sufficient conditions hold at x*, A* then there exists 6’ > 0 such 
that for any 6 >o', x” is an isolated local minimizer of $(x, 1*, 0), that is 

* * 
x =x(A;): 


Proof 
Differentiating (12.2.3) gives 
Vo(x, 4*, o)=g—AA* + ASc. (12.2.5) 


The second order conditions require x*, 2.* to be a KT point, that is g*=A*)* and 
c* = 0, so it follows that Vo(x*, A*, o) = 0. Differentiating (12.2.5) gives 


V’o(x, A4,0)=Wt ASA! AW, (12.2.6) 
say, where W= Vf — ©, (A; — 0;¢;) V"c;, and hence 
We 4 V76(x*, 4%, 6)=W* + A*SA™!. (1229) 


Let rank A* = r<m and let B (n x r) be an orthonormal basis matrix (B' B= I) 
for A*, so that A* = BC where C (= B!A*) has rank r. Consider any vector u #0 
and let u= v + Bw where B'v= 0= A*!y. Then 


ul Weu=v' W*v + 2v'W* Bw + w!B'W*Bw + w'CSC'w. 
From (9.3.5) and (9.3.4) there exists a constant a > 0 such that v'W*v>allv I. 


Let b be the greatest singular value of W*B and let d= ||B'W*B lly. Let 
o = min,o; => 0 and let u > 0 be the smallest eigenvalue of CC!. Then 


ulWeuSallvilZ —2bIvil,lwl, +(ou—d)iw ll. 


Since ll v ll = ll w ll = 0 cannot hold, if o > o' where o' = (d + b?/a)/u then it follows 
that u'W3u > 0. Thus both V¢(x*, 4*, 6) = 0 and V7(x*, 4*, a) is positive 
definite so x* is an isolated local minimizer of $(x, A*, o) for 6 sufficiently 

large. O 


The assumption of second order conditions is important and not easily relaxed. 
For example if f = x} + x,x 2 andc= x, then x* = 0 solves (12.1.1) with the 
unique multiplier \* = 0. However second order conditions are not satisfied, and 
in fact x* = 0 does not minimize (x, 0, 6)=x} + x,x2 +40x3 for any value of 
o. Henceforth it is assumed that second order sufficient conditions hold and o is 
sufficiently large. 

The minimizer x(A) of $(x, 4) can also be regarded as having been determined 
by solving the non-linear equations 


Vo(x, 4) =0. (12.2.8) 


Because V7¢(x*, 2.*) is positive definite and is the Jacobian matrix of this system, 
it follows from the implicit function theorem (Apostol, 1957) that there exist open 
neighbourhoods Q, C IR” about A* and Q, C IR” about x* anda €! function 
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x(A) (Q, > Q,,) such that Vo(x(A), 4) = 0. Also V(x, A) is positive definite for 
allx © 2, and A€ Qy, so x(A) is the minimizer of ¢(x, A). It may be that $(x, A) 
has local minima so that various solutions to (12.2.8) exist: it is assumed that a 
consistent choice of x(A) is made by the minimization routine as that solution 
which exists in Q,.. The vector x(A) can also be interpreted in yet another way as 
the solution to a neighbouring problem to (12.1.1) (see Question 12.14) and this 
can sometimes be convenient. 

‘It is important to examine the function 


W(A)4 o(x(A), 2) (12.2.9) 
which is derived from 1 by first finding x(A). By the local optimality of x(4) on 
Q,., for any A in Q) it follows that 

W(A) = $(x(A), A) < (x*, 2) = O(x*, A*)= WA") (12.2.10) 


(using c* = 0). Thus A” is a local unconstrained maximizer of (2). This result is 
also true globally if x(A) is a global maximizer of $(x, 1). Thus methods for gener- 
ating a sequence 1.) + X* can be derived by applying unconstrained minimization 
methods to —(A). To do this requires formulae for the derivatives VW and VW of 
W with respect to 4. Using the matrix notation [0x/0A] ;; to denote dx;/d,, it 
follows from the chain rule that 


[dy/da] = [d¢/dx] [dx/dA] + [09/04], 


using the total derivative to denote variations through both x(A) and A. Since 
[d¢/dx] = 0 from (12.2.8) and 0¢/0A, = —c; from (12.2.3) it follows that 


V W(A) = —c(x(A)). (2.2.11) 
Also by the chain rule 

[de/dA] = [dc/dx] [8x/4] = AT[0x/dA]. 
Operating on (12.2.8) by [d/4] gives 

[dVo(x(A), a )/dA] = [8V4/dx] [dx/dA] + [0V4/0A] =0. 
But [0V¢/dx] = V7(x(A), 4) = W, and [AVy/aA] = —A 50 it follows that 


V- W(A) = —[de/da] = —ATWe 'A| (22 12) 
x(A) 
Since c(x(A*)) = c(x*) = 0 and Ws is positive definite it follows that Vy(A*) =0 
and (when rank A* =m) that V4") is negative definite, which reinforces the 
maximization result in (12.1.10). 

The most obvious sequence {A‘*)} to be used in step (i) of algorithm (12.2.4) is 
obtained by using Newton’s method from an initial estimate 1.0) giving the 
iteration 


yckt1) = 4k) _ (ATW, TA) Te| reste (12.2.13) 
x(Ar*) 


This method requires W,, and hence explicit formulae for second derivatives, which 
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is disadvantageous. However when only first derivatives are available and a quasi- 
Newton method is used to find x(A“™), then the resulting H matrix (see Section 
3.2) is a very good approximation to W 1 Using this matrix in (12.2.3) (Fletcher, 
1975) enables the advantages of Newton’s method to be obtained whilst only 
requiring first derivatives. A different method is suggested by Powell (1969) and 
by Hestenes (1969), and is best motivated by the fact that for large o, 


(A'Wo!A) 1 ~S (12.2.14) 
(see Question 12.12). When this approximation is used in (12.2.13) the iteration 
AGT) = 10 So) (12.215) 


is obtained. No derivatives are required by this formula so it is particularly con- 
venient when the routine which minimizes $(x, 4) does not calculate or estimate 
derivatives. Furthermore by making S sufficiently large, an arbitrarily fast rate of 
linear convergence of 9) to 2* can be obtained (see Question 12.13). 

An illustration of the methods based on using (12.2.13) or (12.2.15), applied to 
solve the problem (12.1.15), is given in Table 12.2.1, starting from AC) = 0. It can 
be observed that A“) > * and c( = 0 in all cases. The Powell—Hestenes method 
exhibits linear convergence at a rate 0.26 when o = 1 and 0.034 when o = 10. This 
illustrates the fact that increasing o; by 10 will asymptotically reduce the rate of 
convergence of c; by one-tenth. Newton’s method is seen to converge at an approxi- 
mately quadratic rate and is a little better than the Powell—Hestenes method with 
o = 10. 

Although these methods have good local convergence properties, they must be 
supplemented in a general algorithm to ensure global convergence. This can be done 
by an algorithm due to Powell (1969). 


(i) Initially set A= 1©), o=6 ©) K=0, Ie II, = 0, 
(ii) Find the minimizer x(A, o) of (x, A, 6) and denote c = c(x(A,o)). 
(ii) If lel, >4 le Il. set of = 100; Vi: |e, | >All |] 
and go to (ii). = (2.2.16) 
(iv) Setk=k+1, 1,0) = ding) = 0, c\) =¢. 
(v) Seta = 2) — Se(*) and go to (ii). 


The aim of the algorithm is to achieve linear convergence at a rate 0.25 or 

better. If any component c; is not reduced at this rate, the corresponding penalty 
g; is increased tenfold which induces more rapid convergence (see Question 12.12). 
A simple proof that ec“) + 0 can be given along the following lines. Clearly this 
happens unless the inner iteration ((iii) > (ii)) fails to terminate. In this iteration 

2 is fixed, and if for any i,|¢;| >41ce%*-") Il. occurs infinitely often, then 0; > ©. 
As in theorem 12.1.1 this implies c; > 0, which contradicts the infinitely often 
assumption and proves termination. This convergence result is true whatever 
formula is used in step (v) of (12.2.16). It follows as in theorem 12.1.2 that any 
limit point is a KT point of (12.1.1) with x™) > x* and A) — SM e(*) = 1.*. 

For formulae (12.2.13) and (12.2.15), the required rate of convergence is obtained 
when ¢ is sufficiently large, and then the basic iteration takes over in which 6 stays 
constant and only the 4 parameters are changed. 
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In practice this proof is not as powerful as it might seem. Unfortunately increas- 
ing 6 can lead to difficulties caused by ill-conditioning’as described in Section 12.1, 
in which case accuracy in the solution is lost. Furthermore when no feasible point 
exists, this situation is not detected, and o is increased without bound which is an 
unsatisfactory situation. I think there is scope for other algorithms to be determined 
which keep o fixed at all times and induce convergence by ensuring that y(A) is 
increased sufficiently at each iteration, for example by using a restricted step modi- 
fication of Newton’s method (12.2.13), as described in Chapter 5, Volume 1. 

There is no difficulty in modifying these penalty functions to handle inequality 
constraints. For the inequality constraint problem (12.1.2) a suitable modification 
of (12.2.3) (Fletcher, 1975) is 


$(x, 0, 6) = f(x) +5 Loj(c;(x) — 9;)- (12.2.17) 


where a_ = min(a, 0). The effect of this on the problem: min x subject tox > 1 
can also be seen in Figure 12.2.1, the only difference being that for c > 6 (that is 
x >1 +8) the graph of ¢ is identical with that of f. Thus although (x, 6, o) has 
second derivative jump discontinuities at c;(x) = 6;, these are usually remote from 
the solution and in practice do not appear to affect the performance of the uncon- 
strained minimization routine adversely. Another example of this is the inequality 
constraint version of problem (12.1.5). The surface c(x) = @ = 1/,/2 on which this 
discontinuity occurs is illustrated by the dotted circle in Figure 12.2.2, which is 
remote from x”. The contours of (12.2.17) differ from those given only within this 
dotted circle, where they become the unpenalized linear contours of f(x). As with 
(12.2.1) it is possible to rearrange the function, using (12.2.2) and omitting terms 
independent of x, to give the multiplier penalty function 


eee cd if c; <A,/0; 


o(x,r,6 = xX +> 
( 278) —4)?/o; if c; > d,/0; 


l 


(12.2.18) 


where c; = c;(x). Rockafeller (1974) first suggested this type of function, and as 
with (12.2.3) it is the most convenient form for developing the theory. 

Most of the theoretical results can be extended to the inequality case. If strict 
complementarity holds then the extension of theorem 12.2.1 is immediate, although 
the result can also be proved in the absence of this condition (Fletcher, 1975). The 
dual function W(A) is again defined by (12.2.9) and an analogous global result to 
(12.2.10) is 


W(A) = O(K(A), A) < $(x*, 2) 


pee ye C7 SA,/0; 
i l-4d?/o; C7 > Aj/0; 
i (12.2.19) 


u 


1 
ar le 


—5?/0; 
< f* = o(x*, a*)= V(a*). 


This result is also true locally if strict complementarity holds and can probably be 
extended in the absence of this condition, again by following Fletcher (1975). 
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Derivative expressions analogous to (12.2.11) and (12.2.12) are easily obtained as 
dy(A)/da; = —min(c;, 6;), Pa RO ena 7 (12220) 
where c; = ¢;(x(A)) and 0; = A,/o;, and 


AS Weck 0 | 
x(A) 


. (les, 22:21) 


Vv? y(A) = | 
where the columns of A correspond to indices i: c; < 0; and those of S~! to 
indices i : c; > 0;. Algorithms for determining the sequence {A*)} in step (i) of 
algorithm (12.2.4) can be determined using these derivative expressions. An equiv- 
alent form of Newton’s method (12.2.3) is possible, although in view of the implicit 
inequalities 4 > 0 it is probably preferable to choose A‘**1) to solve a subproblem 

maximize q“*)(1) 

r (1227) 
subject toa >0 


where qa) is obtained by truncating a Taylor series for y(A) about A‘) after 
the quadratic term. In fact the simple structure of (12.2.21) enables a more simple 
problem to be solved in terms of just the indices 7 : c; < 0;. The result in (12.2.14) 
can also be used to determine an extension of (12.2.15), that is 


MF) = 0) — min(o,c™, 0), = 1,2,...,m. (2223) 


These formulae can be incorporated in a globally convergent algorithm like 
(12.2.16) by using Il Vy ll.. in place of llc ll.. to monitor the rate of convergence. 

A selection of numerical experiments is given by Fletcher (1975), which seems 
to indicate that whiist both the Newton-like formulae or the Powell—Hestenes 
formulae for updating 0.) are effective, the Newton-like method is somewhat 
more efficient. Local convergence is rapid and high accuracy can usually be achieved 
in about four to six minimizations. When this occurs with modest values of o then 
no difficulties due to ill-conditioning and loss of accuracy are observed. Further- 
more the Hessian matrix can be carried forward, and updated if 0; is increased, so 
the computational effort for the successive minimizations goes down rapidly. 

Since 6(x, A, o) is always well defined, there is no difficulty in coping with in- 
feasible points, and it is easy to program the method using an existing quasi-Newton 
subroutine. The main disadvantage is that the sequential nature of the method is less 
efficient than the more direct approach of Section 12.3. Also the global conver- 
gence result based on increasing o, whilst powerful in theory, does not always work 
well in practice, and there are practical applications in which it has caused ill- 
conditioning and low accuracy. 

Two final points are worthy of note. Firstly if the problem is a mixture of linear 
and non-linear constraints, then it may be worth incorporating only the non-linear 
constraints into the penalty function, and the linear constraints can be included 
when $(x, A, 6) is minimized, for example at step (ii) in algorithm (12.1.4). This is 
especially true for bounds on the variables, since minimization with bounds is not 
a significant complication on an unconstrained minimization routine. Another 
point is that approximate minimization of the penalty function (see (12.1.21) for 
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example) can also be considered, and a review of recent work in this area is given 
by Coope and Fletcher (1979). In this case the algorithms start to become more 
like the direct methods of Section 12.3 and Coope and Fletcher give an algorithm 
which incorporates the Lagrangian correction defined in (12.3.5). 


12.3 The Lagrange—Newton (SOLVER) Method 


A penalty function method is a somewhat indirect way of attempting to solve non- 
linear constraint problems. A more direct and efficient approach is to iterate on the 
basis of certain approximations to the problem functions f(x) and c(x), in particu- 
lar by using linear approximations to the constraint functions c(x). This has to be 
done with some care to ensure rapid convergence properties close to the solution, 
and one particular method is seen to be fundamental in this respect. This method 

is most simply explained as being Newton’s method applied to find the stationary 
point of the Lagrangian function (9.1.6), and hence might be referred to as the 
Lagrange—Newton method. The Lagrangian function is defined in terms of variables 
x andd, soa feature of the resulting methods is that a sequence of approximations 
x“) 14) to both the solution vector x* and the vector of optimum Lagrange 
multipliers 4* is generated. 

In the first instance, consider applying the method to the equality constraint 
problem (12.1.1) and define V as in (9.1.8) so that equation (9.1.7) is the station- 
ary point (KT) condition at x*, 2*. As usual a Taylor series for VY about x), 
1) gives 


WF(x™) + 8x, 1 +82) = WL + [WAG] th) foes (128 1) 


where V7) = ¥ A(x), 1), etc. Neglecting higher order terms and setting the 
left hand side to zero by virtue of (9.1.7) gives the iteration 


6x 
2G(k a k 
[wee] (a) S27). (12.3.2) 
This is solved to give corrections 6x and 6A and is of course Newton’s method for 


the stationary point problem. Formulae for VY and W*Y are readily obtained 
from (9.1.6), giving the system 


w*) —A) ]/8x —g\*) 4 AC) ACK) 
aot 0 52 = ok) : (12.3.5) 
A“*) is the Jacobian matrix of constraint normals evaluated at x“) and 
WY) = v2 FQ) — TAM Y2e ~H) (12.3.4) 
1 


is the Hessian matrix V2. A(x *), 1.) In fact it is more convenient to write 
ak+1) = 4) + §4 and 8 = 8x, and to solve the equivalent system 


Wek) ANTS \ 1 90 
BY) bimmaey ed NW pecaee yah (12.3.5) 
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to determine 8) and A%*!). Then x**1) is given by 
KOFHL) = (4 gO. | (12.3.6) 


The method requires initial approximations x“) and A“), and uses (12.3.5) and 
(12.3.6) to generate the iterative sequence {x‘*), a(*)}. 
An alternative way of deriving this method is to observe that the second order 
sufficient conditions at x*, 4* imply that x* (that is = 0) solves the problem 
minimize Y (x* + 8, *) 
i) 


(1-337) 
subject to A*'5 =0 


because (9.3.5) and (9.3.4) ensure strictly positive curvature in the feasible region. 
Adding 0 =4.*'A*T6§ = g*T§ into the objective yields an equivalent problem 


minimize 4 &'™wW* 6+ 9*'6+/* 
5 (12.3.8) 
subject to A*'5 =0 


which is also solved by 6 = 0, but which has A* as the Lagrange multiplier vector of 
the constraints. The constraints in (12.3.8) are linear approximations at x* to 

e(x) = 0, so by analogy, if x), 7) are approximations to x*, 4”, then solution of 
the subproblem 


minimize q*)(8) 
Fy 


(12.3.9) 
subject to 2) (8) =0 
is suggested, where 
g)(8) 4 48TWH) 8 + gS + fH (12.3.10) 
and 
90)(§) AM § +c). (e811) 


The second order sufficient conditions at x*, A* ensure that (12.3.9) has a well- 
determined solution when x), A{*) are close to x*, 4* (see Question 12.7). Thus 
the following iterative method is suggested, given initial estimates x) and 4@), 


las! ot ne rare 
(i) Solve (12.3.9) (or (12.3.13)) to determine 5) and let 
1&1) be the Lagrange multipliers of the linear constraints. 23712) 
(ii) Set x@*!) =x) + 5), 
This method clearly indicates that for the inequality non-linear constraint problem 
(12.1.2) a suitable generalization is to solve the subproblem 
minimize q‘*)(8) 
5 (13.15) 
subject to 2) (8) > 0. 
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This is the basis of the SOLVER method (Wilson, 1963) as interpreted by Beale 
(1967) and it is convenient to refer to it by this name. Both (12.3.9) and (12.3.13) 
are quadratic programming subproblems and are readily solved by the methods of 
Chapter 10. 

In fact equations (12.3.5) are the KT conditions implied by the solution of 
(12.3.9) so the two are clearly equivalent when a unique solution to (12.3.9) exists. 
However (12.3.9) is more fundamental as it requires the correct curvature con- 
ditions on W*) for a solution to exist, which (12.3.5) does not. The situation is 
analogous to that in which Newton’s method (see Section 3.1) is best interpreted 
as minimizing certain quadratic approximations q)(8). The subproblem (12.3.9) 
obtained on the kth iteration can be interpreted as being derived from (12.1.1) by 
replacing the constraints c(x) = 0 by their linear Taylor series approximation 
9) (8) = 0 at x given by (12.3.11) and by replacing the objective function 
f(x) by the quadratic approximation q)() in (12.3.10). This is a quadratic 
Taylor series approximation at x) but with the addition of constraint curvature 
terms in the Hessian. Including the second order constraint terms in the quadratic 
programming subproblem is important in that otherwise a second order rate of 
convergence for non-linear constraints would not be obtained. This is well illus- 
trated by problem (12.1.5) in which the objective function is linear so that it is 
the curvature of the constraint which causes a solution to exist. In this case the 
sequence of quadratic programming subproblems only becomes well defined if the 
constraint curvature terms are included. 

A numerical example which illustrates many of the features of the method is 
given by the inequality constraint problem (9.1.15) from x‘) = (4, 1)" and 
10) = 0 (see Table 12.3.1). Since 1) = 9 no constraint curvature terms occur 
in W°!), which is therefore the zero matrix since f(x) is linear. Thus the initial sub- 
problem is a linear programming problem and x(?) is the vertex of the constraints 
linearized about x“). In fact, even though the constraint c , (x) > 0 is not active at 
the solution, the presence of its linearization is necessary to permit the first sub- 
problem to be solvable. Moreover there exist different x“) for which the initial LP 
is unbounded, all of which illustrates that the well-behaved nature of (12.3.9) only 
holds necessarily in a neighbourhood of x”, 2*. In this case however the solution 
to the LP is well defined, and 4?) = (4, 3) is the multiplier vector at its solution, 
indicating that both linearized constraints are active. Thus for the second iteration 


Table 12.3.1 The SOLVER method applied to (9.1.15) 


k k 

k x¢ ) rh Mi) i) cf) cf) 
a Ee i Rn ee 
Fo 1 0 0 4 —4 

11 
2 4 5 4 5 —0.173611  —0.284722 
3 0.747120 0.686252 0 0.730415 0.128064  —0.029130 
4 0.708762 0.706789 0 0.706737 0.204445 —0.001893 
5° 30.707107) “0° 707108 20 0.707105 0.207108’, .+0:2849=5 
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which is nicely positive definite. Solving the resulting QP problem causes 
9)(8) 2 0 to become inactive so that n>) = 0. Thus the correct active set is 
established, and the rapid convergence associated with Newton’s method is ob- 
served on subsequent iterations. In comparing this with Table 12.2.1 say, it should 
be remembered that in the latter, each iteration requires an unconstrained mini- 
mization calculation and therefore many evaluations of the problem functions and 
their derivatives. In contrast, the SOLVER method only requires one evaluation of 
the problem function and derivatives to determine the coefficients of the finite 
quadratic programming subproblem. Thus the SOLVER method, if it works, is 
generally considerably superior in terms of the number of function and derivative 
evaluations which are required. 

An important feature of the method which Table 12.3.1 illustrates is that ulti- 
mately the rate of convergence is second order. If second order sufficient conditions 
for the equality constraint problem (12.1.1) hold at x”, A*, and if rank A* = m, 
then the Lagrangian matrix 
w* al 


ater, (12.3.14) 


ree| 
is non-singular (see Question 12.4). The second order convergence of iteration 
(12.3.5) and (12.3.6) then follows by virtue of theorem 6.2.1 applied to the system 
ofn+m equations WA(x, 2) = 0. This requires both x“) and a“) to be suf- 
ficiently close to x* and 4* for some k. In fact the multiplier estimates a) play a 
relatively minor role, in that they only arise in the second order term involving 
W(*) and this can be exploited to give a stronger result. 


Theorem 12.3.1 


wa) _a@) 
i: 


If x) is sufficiently close to x”, if the Lagrangian matrix Be yr 0 


non-singular, and if second order sufficient conditions hold at x* 2* with 
rank A* =m, then the Lagrange—Newton iteration (12.3.5) and (12.3.6) converges 
and the rate is second order. 


Proof 


Define errors h) = x — x* and A = 4, — 1*, and assume that fand the c; 
are C2 and the elements of their Hessian matrices satisfy Lipschitz conditions, so 
that the Taylor series about x(*) 


c* =) — AO" + O( Hh 1?) 
g* =) — y2 fn + OC h™ |?) 
ana) —NicP ON) OCUBV IZ) 5 i= dy Dy moa tt 
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are valid. It follows from (12.3.5) that h**1), AG@*?) satisfy the equations 


wk) =e ie IE a AM y2eOn + O(|h )) 
Ae o JAD)” oh? II?) 
— (0( WhO? 7?) + OCH I} | AC 2» 
"i ( O(a 1?) 


(12.3.15) 


At x*, 2* the Lagrangian matrix is non-singular (see Question 12.4) so for x(k) 
A) in some neighbourhood of x*, 4*, 


Rng 


be = O( Ih ||?) + OCI h*) | | A |) 


and so there exists a constant c > 0 such that 
max( ht!) AAT 1) <¢ ph Iimax(th® |, PAM). (12.3.16) 
Thus, in a smaller neighbourhood, if 1 > ¢ max((| ho*) ll, I A‘) |) =a, say, then 
max(h&*1) |, PAF! I) <eeth™® |< max( th |, 1A 11), 


so the iteration converges and the rate is seen to be quadratic from (12.3.16). Now 
let only x“) be in a neighbourhood of x*, so that A“! has full rank, and let A“ be 
such that the Lagrangian matrix is non-singular. Then || AG) | > | h@) | and so as 
above there exists a constant, d say, such that 


max(||h?) |}, JA) < dphO py pAG Y. 


If x“) is sufficiently close to x™ in that ||h@) | < 1/(cd || AG) |) then 
max( || ho) |, pA |) << 1/e and so x2) 4) is in the neighbourhood for which 
convergence occurs. O 


Essentially therefore the minor role played by A‘) is illustrated by the fact that 
| A®*) | only occurs linearly on the right hand side of (12.3.15), and it is this fact 
that is exploited in the theorem. For example, if x“) = x*, then x2) = x* and 
1) = 2* irrespective of errors in A“!). This suggests that when using the method, 
it is more important to have x“) accurate than 4“), This is in contrast to the 
multiplier penalty function method of Section 12.2 where an inaccurate value of 
x) definitely limits the extent to which x(A{1) agrees with x*, assuming o is 
fixed. The result of theorem 12.3.1 however concerns (12.3.5) and not (12.3.9), 
and if large errors in A“!) are made then the curvature of W‘*) can be such that no 
solution exists to (12.3.9). Thus some care must be taken in interpreting the 
theorem. There is no difficulty in extending the theorem to handle inequality con- 
straints and the details are sketched out in Question 12.18. 

These results show that the local properties of the SOLVER method are very 
satisfactory, so that the main difficulty which exists is the fact that the iteration 
may fail to converge remote from the solution, and that the solution to the sub- 
problem (12.3.9) or (12.3.13) may not even exist (it may either be unbounded or 
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infeasible). To induce global convergence requires some measure of goodness of 
x) and 4) to be available which is minimized locally at the solution. One way 
to do this is to define the error in the KT conditions 


(x, =e IB +1 g— AALS 


where c = c(x), etc., but this is not entirely satisfactory in that it does not give any 
bias to minimizing the objective function and can therefore cause convergence to 
any stationary point to occur. A more suitable possibility is to use an exact penalty 
function, that is any function $(x), defined in terms of f(x), c(x), and possibly 
derivatives thereof, which is minimized locally by x*. An important exact penalty 
function which can be used is the non-differentiable function ¢ = vf + ||¢ ||, which 
is described in more detail in Section 14.3. A smooth exact penalty function is 
given by the augmented Lagrangian function ¢(x, 4*, 6) in (12.2.3) (see theorem 
12.2.1) but this requires a knowledge of 4*. Yet another exact penalty function is 
described briefly in Section 12.5. One possibility for stabilizing the SOLVER 
iteration therefore is to use the resulting correction 6 to define a search direction, 
along which some exact penalty function is minimized. This possibility is explored 
later in the section. However it is important to realize that this idea alone can still 
fail, just as a line search can fail to induce convergence in the Newton—Raphson 
method for solving equations, which is a special case of the SOLVER method 
(Fletcher, 1980b). 

The most successful way of inducing global convergence for unconstrained 
minimization is arguably the restricted step or trust region approach or the use of 
a Levenberg—Marquardt parameter (see Volume 1, Chapter 5). Both these ideas 
have been suggested in the context of SOLVER-like methods (in fact a step restric- 
tion does occur in Beale’s (1967) description of the SOLVER method). Modifying 
WwW) by adding a Marquardt—Levenberg term vI may help to ensure that (12.3.9) 
is not unbounded, but no longer has the effect of giving a bias towards steepest 
descent. In fact for a non-linear equations problem (m =n) the correction dis 
unchanged, so the modification has no effect and therefore does not provide a 
guarantee of convergence. If a step length restriction || 6 || < h*) is added to 
(12.3.9) or (12.3.13) then the possibility of an unbounded correction is entirely 
removed. In this case the difficulty is that if x(*) is infeasible and h) is sufficiently 
small then the resulting subproblem has no feasible point. Thus the idea in Section 
5.1 of forcing convergence by reducing h‘*) if necessary so as to give a descent step, 
is no longer valid. Therefore the straightforward generalization of the SOLVER 
method in this way is unsatisfactory. However it is possible to incorporate the 
restricted step approach with a method which makes an approximation to the 
exact L; penalty function in such a way as to guarantee global convergence and 
yet retain many of the features of the SOLVER method. This approach is described 
in Sections 14.3 and 14.5 and is very promising, although the results regarding rate 
of convergence are somewhat weakened (see again Section 14.5). 

Another practical disadvantage of the SOLVER method is the need to compute 
second derivatives. Variations of the method have been suggested in which updating 
formulae, analogous to those in quasi-Newton methods, are used to revise a matrix 
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B) which approximates w*). Han (1976) suggests using the DFP formula (see 
Volume 1, Section 3.2) but with y(*) being defined by ° 


y) = VS (x(t!) 1D VI (x, RET (123.17) 


and shows that the resulting algorithm is superlinearly convergent. Powell (1978a) 
prefers to keep the matrix B“*) positive definite so that the solution to the sub- 
problem is always well defined. He does this by defining the vector 


n&® = oy) + (1 — 0) BHOSM, 0= b=] (12.3.18) 
(vy as in (12.3.17)) that is closest to y) subject to the condition 
6) HH) > 0.280)" BH§M, 


n{*) is used in place of y*) in the updating formula and Powell prefers the BFGS 
formula on account of its success in solving unconstrained minimization problems. 
It might be thought that this device is somewhat artificial since the matrix W* may 
not be positive definite, so that B“*) may never be a close approximation to Ww. 
However the projections of B) and W“) into the tangent hyperplanes of the 
active constraints are likely to be close, and this is the important part of B®) insofar 
as the solution of the subproblem is concerned. Powell (1978b) is able to exploit 
this observation to prove superlinear convergence even when W* is indefinite. 

Han (1977) also suggests a line search based on using the L; exact penalty function. 
He gives a global convergence result assuming firstly that the matrices BC) are 
positive definite and that B™ and B(x)’ are bounded, and secondly that the 
multipliers A‘*) are bounded. He also makes another assumption which excludes an 
inconsistent linearization which can occur when the vectors a; i€.%™ are depen- 
dent. Using assumptions like these is unsatisfactory and should not obscure the 
need to study more certain ways of inducing global convergence. Powell (1978a) 
also uses a line search using an ZL, exact penalty function but for computational 
reasons changes the penalty parameters pu; (see (14.3.17)). A strong global conver- 
gence result for Powell’s algorithm also does not hold, and failure has been observed 
(Chamberlain, 1979). However good practical results have been observed with these 
algorithms on a number of standard test problems, which substantiates to some 
extent the claim that they provide a robust modification of the SOLVER method. 
Table 12.3.2 compares the number of function and gradient evaluations required 
to solve Colville’s (1968) first three problems and gives some idea of the relative 
improvement that is obtained. Even allowing for the fact that Powell’s method 
solves a QP problem on each iteration, the improvement is still substantial. How- 
ever Powell (1977) reports some difficulty in solving the Dembo 7 test problem 
(Dembo, 1976) (as also do Coope and Fletcher (1979) with an augmented 
Lagrangian method), whilst it has been possible to solve this problem with an 
algorithm of the type described in Section 14.5 (Fletcher, 1980b). Fletcher also 
gives two other problems which are solved by the latter but not by a SOLVER- 
like method and this is an indication of the more robust nature of the type of 
algorithm given in Section 14.5. 
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Table 12.3.2 Comparison of nonlinear programming techniques 


SS 


Extrapolated Multiplier 
Problem barrier function penalty function Powell (1978a) 
ee ee Se OE Ree eo PET EA rk 
TP1 Len 47 6 
EP 245 ha 2) 7 
eS 123 oh) 3 
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12.4 Non-linear Elimination and Feasible Direction Methods 


An apparently attractive approach to the nonlinear programming problem is to 
try to produce direct methods which generalize the ideas in Chapters 10 and 11. 
These methods attempt to maintain feasibility by searching from one feasible 
point to another along feasible arcs (lines in Chapters 10 and 11) and hence are 
referred to as feasible direction methods. They date back at least as far as 
Zoutendijk (1960) and Rosen (1960, 1961). Inequality constraints are handled 
by an active constraint strategy as in Sections 10.3 and 11.2 but attention is 
initially directed towards methods suitable for solving the equality problem 
(12.1.1). The presence of non-linear constraints does not allow a line search to 
maintain feasibility so a simplistic explanation of what is done is that at any feas- 
ible point x) a search direction s‘*) in the tangent plane is calculated and a 
feasible arc is then obtained by projecting any point on the resulting line into a 
corresponding point in the feasible region (see Figure 12.4.1). A search along the 
resulting feasible arc is then made to reduce f(x). Many methods of this type have 
been suggested. Another possible approach to nonlinear programming is to use 
variables to eliminate constraints, if necessary by solving a non-linear system of 
equations at each iteration. It is shown below that this is a special case of a non- 
linear generalized elimination method. This turns out to be equivalent to the idea 
of a feasible direction method and a number of common methods are shown to be 
special cases of elimination. The projection step in the feasible direction method is 
then equivalent to the solution of the non-linear system of equations in the 
elimination method. 


Line in tangent plane 








Feasible arc 





c(x)=0 
Figure 12.4.1 Feasible direction search 
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It is shown in Section 10.1 that many methods for linear constraints are gener- 
alized elimination methods and correspond to different ways of choosing V in 
(10.1.20). Another way of regarding this construction is that a linear transform- 
ation to new variables y is made, defined by 


Gye) 


where the partitions are y, € IR” and yy € IR” ™. Since [A : V] is non-singular 
the transformation is one to one. Then the method is derived by keeping y, = 0, 
thus satisfying the constraints, and minimizing with respect to the remaining vari- 
ables y> (corresponding to y in (10.1.10)). A similar approach is possible when 
solving non-linear constraint problems. It is assumed that x*) is a feasible point 
at which A“) has full rank. It is then appropriate to consider an analogous non- 
linear transformation x @ y defined by 


(2) (0d) goa 


V is such that [A : V] is non-singular in which case the transformation is one-to- 
one in some neighbourhood of x): in fact as (12.4.1) indicates, new y variables 
are defined on each iteration to help keep the transformation well defined. If y is 
any value of the free variables, the corresponding 6 is required which solves 
(12.4.1) with y, = 0. This can be calculated by the Newton—Raphson iteration 
(Section 6.2) in an inner (7) iteration. The Jacobian matrix of the transformation 
(12.4.1) is [A : V] and a suitable initial approximation is 


g() =Zy,, (12.4.2) 


The sequence of iterates is defined for r > 1 by 


sued) = 67 [A: vga) 


= §() _§ (12.4.3) 


where A (and hence S) and c°”) are evaluated at x“) + 8. When the r-iteration is 
deemed to have converged, 6“’*!) becomes the required vector 6. Both 6 and x 
(= x“*) + 8) can be regarded as functions of the free variables y>. In practice it is 
more efficient in (12.4.3) to evaluate A and hence S at x(*), although the r-iter- 
ation no longer has a second order rate of convergence. Convergence of the r-iter- 
ation can be proved if (12.4.2) is a sufficiently good initial approximation which is 
true for small enough y>. In terms of Figure 12.4.1, 8!) in (12.4.2) can be 
regarded as a step in the tangent plane and the iteration (12.4.3) is the projection 
of x*) + &@) so that x) + 8 satisfies the non-linear constraints. The direction of 
the projection is seen from (12.4.3) to lie in the column space of S and will there- 
fore differ depending on how S is defined in (10.1.20). 

The idea can be modified to give a search method by defining p™) € IR"~ as a 
search direction in the space of the free variables and taking y, = ap”) asa step in 
this direction. Then 6 (= 6 (a)) is calculated from (12.4.2) and (12.4.3) and defines 
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a feasible arc x(a) = x‘*) + (a). The direction of the arc at x‘*) is the feasible 
direction s“*) = Zp”) (see Figure 12.4.1). It is then possible to choose a“) to 
optimize f(x*) + 5) by a line search along this feasible arc, assuming the arc exists 
and can be calculated for sufficiently large a, which may not always be the case. 

One method of this type is the GRG method of Abadie and Carpentier (1969) 
which is equivalent to the direct elimination of variables. Thus x is partitioned into 
xX, © IR™ and x, € IR” ™ and the variables x. become the free variables y. The 
matrix § is [Ay ‘ : 0]7 (see (10.1.22)) and Z is [-A,Aj7!: I] 7. How p“) is chosen 
is described below; otherwise the method follows the above scheme. One way of 
writing the projection iteration (12.4.2) and (12.4.3) is to have {x**!>} asa 
sequence of estimates of x**!) defined by 


EO NOS CREETA SY ANS 1A RON tis RP Re (12.4.4) 


whilst vue ec Ne xe *1,1) stays fixed, which shows essentially that the x, variables 
are being chosen so as to eliminate the constraints. Another feasible direction 
method is the gradient projection method of Rosen (1961) in which the feasible 
direction s‘*) is the projection of the negative gradient vector into the tangent 
plane. Rosen achieves this by representing the projection matrix directly, but a 
more stable approach is to use the orthogonal projection method of Section 10.1 
to define S and Z, and then to proceed as described in this section. A feasible direc- 
tion method which generalizes the Wolfe (1963a) reduced gradient method is also 
possible in a similar way. 

The next step is to consider how the direction s“) (= Zp“ )) of the feasible arc 
might be calculated. The objective function can be regarded as defining a function 
fy (vy) =f (x) where x and y are related by (12.4.1). As in Volume 1, Section 3.3, 
the chain rule shows that first derivatives are related by 


= [A-V] ea (12.4.5) 
8y2 
so that using (10.1.20), oy Z'g. can again be interpreted as a reduced gradient 
vector. Thus early methods (Abadie and Carpentier, 1969; Rosen, 1961) choose 
y, =ap‘*) where p‘* ) = _Z™g(*) is the reduced steepest descent direction, and the 
direction of the feasible arc is s‘*) = ay iy AAD It is possible to improve these 
methods with little complication by using conjugate gradients (see Section 4.1) and 
the resuiting methods have been used effectively on some problems. Nonetheless 
ideas of curvature do not figure strongly in these methods and better algorithms can 
be expected. The analogue of Newton’s method (Section 3.1) requires second 
derivatives of f with respect to x or y. Differentiating in (12.4.5) yields 


m 
G,= LY Vc; affay;+ [A: V]G,[A: V]" (12.4.6) 
i=1 


so that the reduced Hessian at x‘) (that is yz = 0) is defined by 
V3, £00) =Z*'WOZ (12.4.7) 
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where WO) = G) — >)() y2e() with 4 = S™g(*). Thus the Hessian of the 
Lagrangian function duly occurs ‘along with what can be regarded a first order esti- 
mate of Lagrange multipliers (see (11.1.18)). The basic Newton’s method therefore 
defines a correction y$* ) by solving the system 


[Z7WOZ]y, =-Z1g) (12.4.8) 


which is then used with (12.4.2) and (12.4.3). Use of this formula to define a search 
direction, or its use in a modified Newton method (see Volume 1, Section 3.1) are 
all possible extensions of this technique. It should be noted that the matrices Z and 
S which arise above should strictly be written as Z“) and S) since they are deter- 
mined by A“*) which is no longer constant as it is in Chapter 10. 

An illustration of this Newton method to solve problem (12.1.5) from the feas- 
ible point x!) = (0.8, 0.6)! is given in Table 12.4.1. Orthogonal factorizations 
Z'A=0, Z'Z =I, and S = A’ are used in the method. The tabulated values are 
related to (12.4.3) by xkt1r) = yh) + §. The convergence of the inner iteration 
(12.4.3) at a very rapid linear rate can be seen, although it is not quadratic since S 
is evaluated at x‘*) and not x**!>”. The rapid convergence of the outer iteration 
which uses (12.4.8) can also be observed, An extension to quasi-Newton methods 
is also possible, in which the matrix Z(*)' W)Z(*) is updated using the BFGS 
formula say, involving differences in reduced gradient vectors. 


yO) = ZRF) glk+1) _ 7 (K)T gf) (12.4.9) 


and differences in reduced coordinates 6 (*) = y$) —0= y$). 

Another second order feasible direction method is summarized by Sargent 
(1974). This is based on using a non- -linear version of (12.3.5) in which the con- 
straint linearization e*) + A)" § = 0 is replaced by (x) +8) =0 as in (12.4.1). 
To allow the possibility of changing the length of the correction, a parameter a is 
introduced, giving the non-linear transformation 


12.4.10 
c(x™) +8)=0 
Table 12.4.1 | Newton’s method for nonlinear programming 
k r x(k r) xh r) x) yf) cl) 
eee 
1 0.8 0.6 OM] —0.142857 
2 1 0.714286 0.714286 —0.020408 
2 0.706122 0.708163 —0.000104 
3 0.70608 1 0.708132 —0.000001 
0.706080 0.708132 —1 49-8 
: i reich: 0.707106 0.001451 
: 0.707108 —0.000002 
2) 0.707107 0.707107 —4 19-13 
3 ~ 0.707107 0 
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which is solved to give 8) and A**!). This system can also be solved in an inner 
(r) iteration by the Newton—Raphson method in the form 


WO BCEED) = QQ ART #1) _ gk) 


e(x(*) - BCH) 4 AT(3(% r+1) __ 864) = 0. (12.4.11) 


Setting A = A*) at the expense of the second order rate of convergence enables the 
linear system (12.4.11) to be solved for 8"*!) and A(**1'*1) py any of the 
usual techniques described in Sections 10.1 and 10.2. Initial values 8!) and 
4(**1,1) for the r-iteration are obtained by solving the linearization of (12.4.10) 
in a similar way. When the r-iteration is deemed to have converged, x**!) is set to 
x) + 6D) and ACY to 14415741) To make the algorithm more robust, a 
line search is included, along the arc defined by solving (12.4.10) for different 
values of a, in order to reduce f(x‘*) + 8) sufficiently. To ensure convergence, 
some sort of bias towards the (reduced) steepest descent direction must be included. 
This can be done for example by adding a Levenberg—Marquardt term vI to W*) 
Alternatively, a quasi-Newton approach is possible, and use of Powell’s modifi- 
cation of the BFGS formula (see (12.3.18)) would be a good way of ensuring that 
W°*) stays positive definite. 

Convergence proofs for feasible direction methods have been given, but there 
are some hidden pitfalls, as Sargent (1974) points out. He considers the possi- 
bility of adding a penalty term, but then the distinction between direct methods 
and penalty function methods using low accuracy minimization becomes indistinct. 
As I see it, the main difficulty of feasible direction methods is the requirement to 
converge the inner r-iteration, because the basic Newton—Raphson method is not 
guaranteed to converge, and more elaborate alternatives would unduly complicate 
the overail method. Although the basic Newton—Raphson method may converge 
for small enough a, this may place an undue restriction on the length of step. Also 
there is a correlation between the convergence of the Newton—Raphson method 
and the rank of the Jacobian matrix, and it is not attractive to have to make 
implicit assumptions about the latter. 

Additional complications arise when attention is switched from equality prob- 
lems to consider the inequality constraint problem (12.1.2). The most obvious 
approach is to use an active set strategy as in Section 11.2. The general idea of 
keeping c;(x) = 0, i € .%, and of systematically adding and dropping constraints 
remains the same, and estimates of Lagrange multipliers and precautions against 
zigzagging need also to be made in a similar way. There is however an additional 
difficulty which arises when the correction Zy, in (12.3.2) causes else Jai) 
violate constraints not in the active set. Ideally the analogue of the linear constraint 
algorithm (11.2.2), step (e), would be to iterate the search along the feasible arc 
with respect to changes in a in an attempt to find the largest value of a = at) 
which gives a feasible point. The constraint which is then limiting an increase in 
a can then be added into the active set, if appropriate. This process was originally 
used by Rosen (1961) but it adds yet another level of iteration into the method, 
which can therefore become very inefficient. Furthermore it is often possible to 
judge directly which constraint will become active, and this constraint can be 
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zeroed with the other active constraints by extending the dimension of the r-iter- 
ation (12.4.3). However there are then difficulties in writing down a strategy which 
covers all possibilities. For example x(*+1,1) may violate so many constraints that 
it is not clear which ones to include in the r-iteration. Also constraints violated by 
x(*+1,1) might not be active at all on the feasible arc. It is possible to introduce 
additional ad hoc rules to cover these cases but the resulting algorithm becomes 
cumbersome and unappealing. 

In some ways it might be better to disregard the active set strategy and think in 
terms of a more complicated non-linear transformation to replace (12.4.1) or 
(12.4.10). Sargent (1974) suggests 


WHS = a(AM%) 1 — g*)) 
(x) + 6) >0, 1>0 
aTo(x(*) + 6) = 0 


as the appropriate analogue of (12.4.10) for an inequality constraint problem. This 
can be solved by an inner iteration involving a sequence of quadratic programming 
subproblems. This is a very elegant solution to the difficulties caused by inequality 
constraints, although it does not help in solving the fundamental difficulty which 
is to force convergence of the inner iteration. 


12.5 Other Methods 


Many other methods have been suggested for nonlinear programming, and a few of 
these are reviewed in this section. Although these methods are perhaps currently 
not as popular as those described previously, they often exhibit considerable 
ingenuity and contain interesting ideas which are worthy of note. An interesting 
penalty function for the equality problem (12.1.1) arises from the observation that 
the function 


fa 
Cy (x) 


C2 (x) 


(x, f) = (CORE 


em(X) lp 


is minimized by x” if the control parameter f= f*. Thus a sequential penalty func- 
tion method can be envisaged in which a sequence of estimates (giant > f* is 
generated so that the minimizers x(f*)) > x*. Morrison (1968) suggests a method 
with p = 2 in which the sum of squares function [9(x, f)]* is minimized sequen- 
tially, but there is a potential difficulty that if m <n then V(x", f) tends toa 
singular matrix as f> f * This can slow down the rate of convergence of conven- 
tional algorithms. Gill and Murray (1976a) introduce two new features, one of 
which is a special purpose method for solving ill-posed least squares problems. They 
also give a new formula for updating the control parameter f‘*) which kas a second 
order rate of convergence and controls cancellation errors. 
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Another penalty function which relates the ideas in (12.5.1) to those contained 
in multiplier penalty functions (Section 12.2) and L.. exact penalty functions 
(Section 14.3) is given by Bandler and Charalambous (1972). The function 


(x, p, a, f)= (E fP)'?, eer 


fox) (125.2) 
Teds) — 070, (X), te 


is defined, where f< f* and f; >0 Vi =O. There are many ways to force the 
minimizer x(p, a, f) of this function to converge to x”. In a neighbourhood of x* 
and in the limit p > °°, the function becomes equivalent to the exact L.. penalty 
function. Thus one possible mode of application is a Polya-type algorithm in which 
f and mare fixed, and a control sequence {p®} + ce is used. Another possibility is 
to fix p and @ and to choose a sequence of control parameters f *) > f* as in 
(12.5.1). Finally p and f can remain fixed, in which case a sequence a(*) > 

2x.*/(m + 1) of scaled Lagrange multiplier estimates can be used as control para- 
meters. Charalambous (1977) shows how a suitable updating formula for the a‘*) 
parameters can be determined. 

Another attractive idea is to attempt to define an exact penalty function (x) 
which is minimized locally by the solution x* to (12.1.1) or (12.1.2). A simple 
way to do this is described in Section 14.3 and gives rise to a non-differentiable 
function. This requires special purpose techniques to solve the unconstrained 
minimization problem, but is a promising approach, and one particular algorithm 
is described in Section 14.5. It is also possible to determine a smooth exact penalty 
function. The key idea here (Fletcher, 1973) is to use an augmented Lagrangian 
function in which A is not an independent vector but is a function A(x) determined 
by a finite calculation. In particular the choice A(x) = A*'g|,, obtained by solving 
the over-determined least squares problem A A= g, gives rise to the penalty function 


(x) = f(x) — A(x)! e(x) + e(x)'Se(x). (12753) 


An obvious choice for S is 401, in which case the penalty function is exact for any 
value of o above a certain threshold. In fact the choice S = oA*A*! turns out to be 
more significant, since it is then possible to rearrange ¢(x) to give 


$(x) = f(x) — n(x)" (x) (12.5.4) 
where 
n(x) = A*(g— oA*!c) |x 


is the Lagrange multiplier vector of the subproblem 


> 


minimize $06! 6+g1 
é (12°5.5) 
subject to Al &+c=0 


where g, A, and c are evaluated at x. In fact the solution 6(x) of this system is dis- 
carded, but the multipliers m(x) are substituted into (12.5.4) to define ¢(x). The 
constraints in (12.5.5) are linearizations of the non-linear constraints about x 
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(see (12.3.11)) so a generalization to solve inequality constraints is immediate. 

The resulting function (x) can be minimized by any suitable smooth unconstrained 
minimization technique. The main difficulty with this approach is that the calcu- 
lation of A(x) or (x) requires first derivatives of f and c. Thus V¢ requires second 
derivatives of f and c, which are often not available. However, if these derivatives 
are available, then an O(h) estimate of V7¢(x*) can be made, which ensures second 
order convergence, and hence a Newton-like minimization method is very practi- 
cable in these circumstances. 

Another idea which occurs frequently is to make linearizations of the non-linear 
functions which arise in the problem, and then solve LP or QP subproblems. Some 
early methods such as the cutting plane method and MAP make linear Taylor series 
approximations like (7.1.3) about the current iterate x) to both the objective and 
constraint functions, and the resulting LP (possibly including some step-length 
restriction) is solved to determine x(**1)_ This may be satisfactory when the solu- 
tion is at a vertex of the feasible region, but in general does not model the effect 
of curvature. Thus the convergence properties suffer and there can be numerical 
difficulties. The techniques of Section 12.3 are therefore recommended, since they 
avoid all these difficulties albeit at the expense of solving a QP subproblem. Another 
technique which uses linear approximations is separable programming (see Beale 
(1970) for instance) in which the objective and constraint functions are separable. 
A separable function is one which can be written as 


f(x) = 2 Fi (zi) (12.5.6) 


where the scalar quantities z; are either linear functions of x or other separable 
functions. Each separable function is replaced by a piecewise linear approximation 
whose knots are at fixed values of the variable z;. It is then possible to reduce the 
problem to something like a linear programming problem, but with differences in 
the rules for changing the basis. The accuracy of the solution clearly depends on 
that of the piecewise linear approximations. In fact separable programming is most 
useful when a rough estimate of the solution is satisfactory and when the problem 
is substantially linear and sparse, such as in business or economic models. 

Another type of algorithm which solves QP subproblems is given by Murray 
(1969). This is motivated by trying to avoid the ill-conditioning associated with 
o > in the penalty function (12.1.3). A simple interpretation of the method 
arises from the fact that the minimizer x(o) of (12.1.3) equivalently solves the 
constrained problem 


minimize f(x) 
x (12.5.7) 
subject to c(x) = c(x(a)). 


A sequence of values o*) > ce is chosen although this is done as the algorithm pro- 
ceeds. Murray (1969) does not give details but suggests that the rate of increase 
should depend on how well the solution is approximated. The quadratic approxi- 
mation (12.3.10) to f(x) and the linear approximation (12.3.11) to c(x) are made, 
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and using the asymptotic result (12.1.10) the subproblem 
minimize q*)(8) 


12.5.8 
ol g(k) (12.5.8) 


subject to 2) (8) = (KFT) 
oO 





is defined. This is solved to determine a search direction along which $(x, o(**!)) 

is minimized. The line search offsets the apparent disadvantage that the ‘best’ 
choice o(**!) = ce is not usually made. The method can be improved by introducing 
estimates of Lagrange multipliers and such a method is described by Biggs (1975). 
This method has been used widely in practice and has performed well, but it turns 
out to be similar to the methods of Section 12.3 and a detailed discussion of the 
method will not be given here. A convergence proof for the method is given by 
Biggs (1978) but seems to rely strongly on the requirement of uniform indepen- 
dence of the columns of the Jacobian matrix which is an unrealistic assumption. 


Questions for Chapter 12 


1. Show that if the penalty function (12.1.3) is applied to a quadratic program- 
ming problem with equality constraints, then ¢(x, ) is a quadratic function of 
x. State the resulting Hessian matrix V2¢. 

2. Consider using the penalty function (12.1.3) to solve the problem: 
min —x1xX2x3 subject to 72 —x, — 2x2 — 2x3 = 0. Verify that the explicit 
expression for x(a) given by x. =x3 = 24/(1+-V/(1 — 8/o)), x; = 2x2, satisfies 
V¢(x(a), 0) = 0. Verify also that x(a) > x* as o > °°. Find x(o) when o = 9 
and verify that V7¢(x(9), 9) is positive definite. 

3. Consider the problem (12.1.5). For the penalty function (12.1.3) show that 
the elements of the minimizer x(o) satisfy the equationsx,; = x2 and 
2x? —x, —1/(20) = 0. Show that as o >, x, = 4./2 t+ a/o + O(1/o7), and 
find a. Consider now the problem: min —x, — x2 subject to 1 —x7 —x3 >0 
andxz — x? = 0. Show how the penalty function may be modified to solve 
this problem. For what values of o do the minimizers of the two penalty 
functions agree? 

4. Show that if second order conditions (9.3.5) hold, and if rank A* = m, then the 


* * * x 
aft r | is non-singular. Let Be. = \(- 0. 
Show that rank A* = m implies t = 0. If s# 0 show that s satisfies (9.3.4) and 
s' W*s = 0 which contradicts (9.3.5). Hence conclude that the Lagrangian 

matrix is non-singular. 

5. Let c&*) + c*, A® > A* (A isn x m,n >m) and let rank A* =m. Use the 
Rayleigh quotient result (u'Mu>d,,u'u V u) and the definition of a singular 
value o; of A (square root of an eigenvalue ), of A‘A) to show that if 
0<B<o, then |AMc Il, > Bl c(*) ||, for all k sufficiently large. Hence 


establish (12.1.22). 
6. For the inverse barrier function (12.1.24) show that estimates of Lagrange 


Lagrangian matrix 
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10. 


Le 


multipliers analogous to (12.1.8) are given by f*) = pI) K) Hence show 
that i) + 0 for an inactive constraint and that any limit point x*, 4* is a KT 
point. Show that h) has asymptotic behaviour O(r®) : ) which can be used 
in an acceleration technique (see SUMT of Fiacco and McCormick, 1968). Show 
also that o(r) estimates of f* can be made in a similar way to (12.1.12). 


. For the logarithmic barrier function (12.1.25) show that estimates of Lagrange 


multipliers are given by i) = r/o) that x*, 4* isa KT point, that the 
asymptotic behaviour of h“*) is O(r), and that o(r) estimates of f* can be 
made. 


. Investigate the Hessian matrices of the barrier functions (12.1.24) and (12.1.25) 


at x(r*)) and demonstrate ill-conditioning as r*) + 0 if the number of active 
constraints is between 1 and n — 1. What happens when ); = 0 for an active 
constraint? 


. When the penalty function (12.1.23) is applied to the problem 


minimize wi) 255) a5 X3 
subject to0O< x3 <1 
x3 +x3 <1 
xt tx tx <i 
the following data are obtained. Use this to estimate the optimum solution and 


multipliers, together with the active constraints, and give some indication of 
the accuracy which is achieved. 





k ae) x1 (0) x2 (0) x3(0(*)) 

1 1 0.834379 0.834379 —0.454846 
z 10 0.728324 0.728324 —0.087920 
3 100 0.709557 0.709557 —0.009864 
4 1000 0.707356 0.707356 —0.001017 





Consider the problem 


minimize exp(x,x2x3X4Xs) 
x 


subject to x} + x3 + x3 +x3+x2=10 


given by Powell (1969). Investigate how accurately a local solution x* and 
multipliers A* can be obtained, using the penalty function (12.1.3) and the 
sequence {o(*)} = 0.01, 0.1, 1, 10, 100. Use a quasi-Newton subroutine which 
requires no derivatives. A suitable initial approximation to x(o‘ 1)) is the vector 
(—2, 2, 2, -1, —1)', and it is sufficient to obtain each element of x(a*)) 
correct to three decimal places. 


Consider using the multiplier penalty function (12.2.3) to solve the problem 


1s, 


13: 


14. 


BS; 
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in Question 12.2. Assume that the controlling parameters \“) = 0 and o = 9 
are chosen initially, in which case the minimizer x(0, 9) is the same as the 
minimizer x(9) which has been determined in Question 12.2. Write down the 
new parameter \(?) which would be given by using formulae (12.2.13) and 
(12.2.15). Which formula gives the better estimate of \*? 

Use the Sherman—Morrison formula (Volume 1, Question 3.13) to show that 
if S is non-singular, then 


te he ral -| W alee °| 
A 0 =A "0 of Sst 
Hence use (10.2.6) to show that, in the notation of (12.2.6), 
(AW, A) *=(A'W UA) +S. 
Let o, be fixed so that W,, is positive definite. Deduce that 
(ATW, 1A)! =S + (ATWGIA)-! —S, =S+0() 
which is (12.2.14). 
Consider any iteration function 
MW =o A) 6 1. MOM 
analogous to (12.2.13) or (12.2.15) for use with a multiplier penalty function. 
Show that 
VO'(A")=1—(A°'W, A )M!. 
Hence use the result of Question 12.12 and theorem 6.2.2 (Volume 1) to show 
that the Powell—Hestenes formula (12.2.15) converges at a linear rate, which 
can be made arbitrarily rapid by making S sufficiently large. 


Show that the vector x(A) which minimizes (12.2.1) equivalently solves the 
problem 


minimize f(x) 
x 
subject to c(x) = c(x(A)). 
If x(A) minimizes (12.2.18) show that the equivalent problem is 
minimize f(x) 
x 
subject to c;(x) > min(c;(x(A)), 6;), RPE Meeip 


(Fletcher, 1975). These problems are neighbouring problems to (12.1.1) and 
(12.1.2) respectively. 
Consider using the penalty function 


P(x, A, 6) = f(x) + Yo; exp(—Ajc;(x)/o;) 


where o; > 0, to solve problem (12.1.1). Assume that a local solution x” of 
this problem satisfies the second order sufficient conditions (9.3.5) and (9.3.4). 
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Show that if control parameters A= 2* and any o are chosen, then x” is a 
stationary point of $(x, 4*,). Show also that if, for all 7, both Ai #0 and o; 
is sufficiently small, then x is a local minimizer of $(x, 4*,o). (You may use 
without proof the lemma that if A is symmetric, if D is diagonal with suf- 
ficiently large positive elements, and if v' Av > 0 for all v (#0) such that 

By =0, then A + BDB! is positive definite.) 

What can be said about the case when some \; = 0? To what extent is this 
penalty function comparable to the Powell—Hestenes function? 

If x(A) is the unique global minimizer of $(x, 4, o) for all A, where o is 
constant, and if x* = x("*), show that 4* maximizes the function W(A) 4 
o(x(A), 4, 6). What are the practical implications of this result? 

Solve the problem in Question 12.19 by generalized elimination of variables 
using the Newton method as shown in Table 12.4.1. Compare the sequence of 
estimates and the amount of work required with that in Question 12.19. 
psa that second order sufficient conditions (9.3.5) and (9.3.4) hold at 

moAgelet Cee NO) > (x*, 2*) and let a vector s (Ils Il, = 1) exist 
a that s) 'WHs(k) <0 V k. Show by continuity that (9.3.5) and (9.3.4) 
are contradicted and hence that the subproblem (12.3.9) is well determined 
for x), 1) sufficiently close to x*, *. 


. Consider generalizing theorem 12.3.1 to apply to the inequality constraint 


problem (12.1.2). If strict complementarity holds at x*, 4*, show by virtue of 
the implicit function theorem that a non-singular Lagrangian matrix (12.3.14) 
implies that the solution of (12.3.5) changes smoothly in some neighbourhood 
of x*, 4*. Hence show that the solution of the equality constraint problem 
also solves the inequality problem, so that constraints c;(x), i¢.%*, can be 
ignored. If strict complementarity does not hold, assume that the vectors 

ay, i€.* are independent and that the second order sufficient conditions 
of theorem 9.3.2 hold. Deduce that the Lagrangian matrix (12.3.14) is non- 
singular for any subset of active constraints for which. } C o C.%*, Show 
that there is a neighbourhood of x*, 4* in which the active constraints 
obtained from solving (12.3.13) satisfy this condition, so that a result like 
(12.3.16) holds for any such .», and hence for.”*. The rest of the theorem 
then follows. 


. Consider solving the problem: minx, + x2 subject tox, > x7 by the 


Lagrange—Newton method (12.3.12) from x“) = 0. Why does the method 
fail if 1) = 0? Verify that rapid convergence occurs if \“) = 1 is chosen. 


Chapter 13 


Other Optimization Problems 


13.1 Integer Programming 


In this chapter two other types of problem are studied which have the feature that 
they can be reduced to the solution of standard problems described in other 
chapters. Integer programming is the study of optimization problems in which 
some of the variables are required to take integer values. The most obvious example 
of this is the number of men, machines, or components, etc., required in some 
system or schedule. It also extends to cover such things as transformer settings, 
steel stock sizes, etc., which occur in a fixed ordered set of discrete values, but 
which may be neither integers, nor equally spaced. Most combinatorial problems 
can also be formulated as integer programming problems; this often requires the 
introduction of a zero—one variable which takes either of the values 0 or 1 
only. Zero—one variables are also required when the model is dependent on the 
outcome of some decision on whether or not to take a certain course of action. 
Certain more complicated kinds of condition can also be handled by the methods 
of integer programming through the introduction of special ordered sets; see Beale 
(1978) for instance. There are many special types of integer programming problems 
but this section covers the fairly general category of mixed integer programming 
(both integer and continuous variables allowed, and any continuous objective 
and constraint functions as in (7.1.1)). The special case is also considered of mixed 
integer LP in which some integer variables occur in what is otherwise an LP 
problem. The branch and bound method is described for solving mixed integer 
programming problems, together with additional special features which apply to 
mixed integer LP. There are many special cases of integer programming problems 
and it is the case that the branch and bound method is not the most suitable in 
every case. Nonetheless this method is the one method which has a claim to be of 
wide generality and yet reasonably efficient. It solves a sequence of subproblems 
which are continuous problems of type (7.1.1) and is therefore well related to other 
material in this book. Other methods of less general applicability but which can be 
useful in certain special cases are reviewed by Beale (1978). 

Integer programming problems are usually much more complicated and expen- 
sive to solve than the corresponding continuous problem on account of the dis- 
crete nature of the variables and the combinatorial number of feasible solutions 


Negi! 


158 


which thus exists. A tongue-in-cheek possibility is the Dantzig two-phase method. 
The first phase is try to convince the user that he or she does not wish to solve an 
integer programming problem at all! Otherwise the continuous problem is solved 
and the minimizer is rounded off to the nearest integer. This may seem flippant 
but in fact is how many integer programming problems are treated in practice. 
It is important to realize that this approach can fail badly. There is of course no 
guarantee that a good solution can be obtained in this way, even by examining all 
integer points in some neighbourhood of the continuous solution. Even worse is the 
likelihood that the rounded-off points will be infeasible with respect to the con- 
tinuous constraints in the problem (for example problem (13.1.5) below), so that 
the solution has no value. Nonetheless the idea of rounding up or down does have 
an important part to play when used in a systematic manner, and indeed is one 
main feature of the branch and bound method. 

To develop the branch and bound method therefore, the problem is to find the 
solution x” of the problem 


P;: minimize f(x) 


subject toxE R, x; integer Viel (13.1.1) 


where J is the set of integer variables and R is the (closed) feasible region of the 
continuous problem 


P: minimize f(x) subject tox ER, (f3212) 


which could be (7.1.1) for example. Let the minimizer x’ of P exist: if it is feasible 
in P, then it solves P;. If not, then there exists an i € J for which x; is not an 
integer. In this case two problems can be defined by branching on variable x; in 
problem P, giving 


P~: minimize f(x) 
" (13.1.3) 


subject toxGR, =x, <= (x, 


and 


P*: minimize f(x) 


subject tox GiRe x)= [x] ts (13.1.4) 


where [x] means the largest integer not greater thanx. Also define P; as problem 
(13.1.3) together with the integrality constraints x;, i € I, integer, and P;* likewise. 
Two observations which follow are that x” is feasible in either P~ or P* but not 
both, in which case it solves P;- or P,*. Also any feasible point in Pr or P;’ is 
feasible in P;. 

It is usually possible to repeat the branching process by branching on P~ and 
P*, and again on the resulting problems, so as to generate a rree structure (Figure 
13.1.1). Each node corresponds to a continuous optimization problem, the root is 
the problem P, and the nodes are connected by the branches defined above. There 
are two special cases where no branching is possible at any given node: one is where 
the solution to the corresponding problem is integer feasible (9 in tree) and the 
other where the problem has no feasible point (@ in tree). Otherwise each node is a 
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x, 2 [xf] +1 (eA 


Figure 13.1.1 Tree structure in the branch and bound 
method 


parent problem (© in tree) and gives rise to two branched problems. If the feasible 
region is bounded then the tree is finite and each path through the tree ends ina 

© or a@. Assuming that the solution x* of P; exists, then it is feasible along just 
one path through the tree (the broken line, say, in Figure 13.1.1) and is feasible 

in, and therefore solves, the 9 problem which must exist at the end of the path. 

The solution of every 5 problem is feasible in P; and so the required solution vector 
x” is that solution of a 9 problem which has least objective function value. Thus the 
aim of the branch and bound method is to seek out 5 nodes by exploring the tree 
in order to find the one with the least value. Often in the tree there are © nodes 
whose solution violates more than one integrality constraint, so in fact the tree is 
not uniquely defined until the choice of branching variable is defined. One anom- 
alous (although unlikely) possibility is that x* may be well defined, but the con- 
tinuous problem P may be unbounded (f(x) > —ce). This possibility is easily 
excluded in most practical cases by adding bounds to the variables so as to make 

R bounded. 

To find the type of each node in the tree requires the solution of a continuous 
optimization problem. Also the number of nodes grows exponentially with the 
number of variables and may not even be finite. Therefore to examine the whole 
tree is usually impossibly expensive and so the branch and bound method attempts 
to find the solution of P; by making only a partial search of the tree. Assuming 
that a global solution of each problem can be calculated, adding the branching 
constraint makes the solution of the parent problem infeasible in the branched 
problem, and so the optimum objective function value of the parent problem is 
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a lower bound for that of the branched problem. Assume that some but not all of 
the problems in the tree have been solved, and let f; bethe best optimum objective 
function value ‘which has been obtained by solving the problem at a 5 node i. The 
solution of this problem is feasible in P; so f* <f;. Now if f; is the value associated 
with any © node j, and f; > f;, then branches from node j cannot reduce f; and so 
they cannot give a better value than f;. Hence these branches need not be examined. 
This is the main principle which enables the solution to be obtained by making only 
a partial search of the tree. 

Nonetheless because of the size of the tree, the branch and bound method only 
becomes effective if the partial search strategy is very carefully chosen. Two import- 
ant decisions have to be made: 














(a) which problem or node should be solved next, and 
(b) on which variable should a branch be made. 


The following would seem to be an up-to-date assessment of the state of the art 
(see Beale (1978), for example, for more detailed references). An early method 
favoured the use of a stack (a last-in first-out list or data structure). This algorithm 
keeps a stack of unsolved problems, each with a lower bound L on its minimum 
objective function value. This bound can be the value of the parent problem, 
although in linear programming problems a better bound can be obtained as 
described below. From those 5 nodes which have been explored, the current best 
integer feasible solution x with value f is stored. Initially f=, the continuous 
problem is put in the stack with L = —co, and the algorithm proceeds as follows. 





(i) If no problem is in the stack, finish with x, f as x*, f*; otherwise take the 
top problem from the stack. 

(ii) If L >f reject the problem and go to (i). 

(iii) Try to solve the problem: if no feasible point exists reject the problem and 
go to (i). 

(iv) Let the solution be x’ with value f’: if f’ > f reject the problem and go 
to (i). 

(v) Ifx' is integer feasible then set x= x’, f= f' and go to (i). 

(vi) Select an integer variable i such that [x;] <x;, create two new problems 
by branching on x;, place these on the stack with lower bound L = f' 
(or a tighter lower bound derived from f’), and go to (i). 


The rationale for these steps follows from the above discussion: in step (iv) any 
branched problems must have f > f’ > f > f* so cannot IpIONS the optimum, 
so the node is rejected. Likewise in step (ii) where f>L >f>f*. Step (iii) corre- 
sponds to finding a @ node and step (iv) to a 5 node. Step (vi) is the branching 
operation, and step (i) corresponds to having examined or excluded from con- 

ue taies all nodes in the tree, in which case the solution at the best 5 node gives 
x”. The effect of this stack method is to follow a single path deep into the tree 
to a0 node. Then the algorithm works back, either rejecting problems or creating 
further sub-trees by branching and perhaps updating the best integer feasible sol- 
ution x and f. This is illustrated in Figure 13.1.2. 
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First integer 
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Dek or creating further 
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Figure 13.1.2 Progress of the stack method 


Problems on stack 


So far the description of the algorithm is incomplete in that rules are required 
for choosing the branching variable if more than one possibility exists, and for 
deciding which branched problem goes on the stack first. To make this decision 
requires estimates e;" and e;_ (0) of the increase in f when adding the branching 
constraints x; > [x;] + 1 and x; < [x;] respectively. These estimates can be made 
using information about first derivatives (and also second derivatives in other than 
LP problems) in a way which is described below. Then the rule is to choose i to 
solve max;max(e;’, e; ) and place the branched problem corresponding to 
max(e;’, e; _) on the stack first. Essentially the worst case is placed on the stack 
first, and so its solution is deferred until later. The motivation for this rule is 
hopefully that, being the worst case, this problem will have a high value of f and 
so can be rejected at a later stage with little or no branching. A numerical example 
of the stack method to the integer LP problem 


minimize x, + 10x, 
subject to 66x, + 14x. > 1430 (1321-5) 
—82x, + 28x, => 1306, X1,X2 integer 


is illustrated in Figure 13.1.3. Some shading on the grid shows the effect of adding 
extra constraints to the feasible region, and numbered points indicate solutions to 
the subproblems. The solution tree is shown, and the typical progress described in 
Figure 13.1.2 can be observed. After branching on problem 7, the stack contains 
problems 9, 8, 6, 4, 2 in top to bottom order. Problem 9 gives rise to an integer 
feasible solution, which enables the remaining problems to be rejected in turn with- 
out further branching. Notice that it may of course be necessary to branch on the 
same variable more than once in the tree. 

In practice it has become apparent that the stack method contains some good 
features but can also be inefficient. The idea of following a path deep into the tree 
in order to find an integer feasible solution works well in that it rapidly determines 
a good value of f which can be used to reject other branches. This feature also gives 
a feasible solution if time does not permit the search to be completed. Furthermore 
it is very efficient in the amount of storage required to keep details of unsolved 
pending problems in the tree. However the process of back-tracking from a 5 node 
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> eae List of problems 
No. x4 x2 if 
66x, +149 1 7.26 67.91 686.4 
1430 2 <67 Not feasible 
3 7.24 68 687.2 
4 8 WO MOTE TAGES TF 
S i 69.14 698.4 
6 <69 Not feasible 
V6 6.82 70 706.8 
hg 8 6 73.86 744.6 
S) 7 70 707 
-82x,+ 28x5 
x2o=71 F:>1306 
x2 =70 
x2=69 
XO =68 
Solution 
xo=67 tree 
4476 427 4428 


Figure 13.1.3 Solution of (13.1.5) by the branch and bound method 


to neighbouring nodes in the tree can be unrewarding. Once a path can be followed 
no further it is preferable to select the best of all pending nodes to explore next, 
rather than that on top of the stack. To do this, it is convenient to associate an 
estimate F of its solution value with each unsolved problem which is created by 
branching. E is conveniently defined by f' + e;* and f’ + e; respectively, where 
f’ is the value of the parent problem and e;’, e; are the estimates referred to 
above. A path starting at the pending node with least EF is then followed through 
the tree. Rules of this type are reviewed by Beale (1978) and Breu and Burdet 
(1974). Another useful idea is to have the user supply a cut-off value, above which 
solutions are of no interest. This can be used initially in place of f = co and may 
prevent deep exploration of unfavourable paths. A further disadvantage of the stack 
method is that the ‘worst case’ rule for choosing the branching variable is often 
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inadequate. Present algorithms prefer to branch on the ‘most important variable’, 
that is the one which is likely to be most sensitive in determining the solution. This 
can be determined from user-assigned ‘priorities’ on the variables (often zero—one 
variables should be given high priority). To choose between variables of equal 
priority the variable which solves max;min(e;_, e;* ) is selected, that is the one for 
which the least estimate of the increase in f is largest. This quantifies the definition 
of the most important variable. Other possible selection rules are discussed by Breu 
and Burdet (1974). 

The branch and bound strategies depend critically on making good estimates e;* 
ande; of the increases in f caused by adding either of the branching constraints. 
Such estimates are readily made for an integer LP problem and are most easily 
described in terms of the active set method (Section 8.3). Let x’ be the solution of 
a parent node, value f’. Then the directions of feasible edges at x’ are the vectors 
Sq where A! = [s,,89,...,S,]. The slope of f along Sq is given by the multiplier 
Ag 0). Consider finding e;*. A search along Sq increases x; if (s,); > O ata rate 
(s,);, and increases f at a rate Aq. Thus for small changes Ax,, if a constraint 
x; =x; + Ax; is added to the problem, then the increase Af in the optimal solution 
value f’ is given by Af = p; Ax; where 


pe = min oe (13.1.6) 
q:(8q)i>0 (Sq)i 
For larger changes, previously inactive constraints may become active along the 
corresponding edge, and so further increases in f occur on restoring feasibility. 
Thus in general a lower bound Af > p;*Ax; on the increase in f is obtained. 
Thus if the fractional part of x; is ¢; (= x; — [x;]) then the required estimate of 
the change in f is 


e;' =p; (1 — %)- (ssi?) 


It may be that no element (s,); > 0 exists, corresponding to the fact that no feas- 
ible direction exists which increases x;. Thus the resulting branched problem is 
infeasible (a © node). It can be convenient to define e;* = p;’ = © in this case. 
A similar development can be used when decreasing x;. Consider a small change 
Ax; (>0), and add the constraint x; < x; — Ax; to the problem. Then Af= p; Ax; 
where 

prea lonmin a (13.1.8) 

q:(Sq)i <0 —(Sq); 





In general Af > p; Ax;, and the required estimate is 
€; = Di %- 3329) 


The estimates (13.1.7) and (13.1.9) are used as described above in choosing the 
branching variable and estimating E for each unsolved problem created by branch- 
ing. In fact since for integer LP these estimates give strict lower bounds on Af, it 
follows that use of L = E gives a tighter lower bound than the value of the parent 
problem Z = f’. Other ways in which the structure of an LP problem can be used 
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advantageously are given by Beale (1968). Often the dual formulation is recom- 
mended for solving the LP problem since constraints can be added whilst retaining 
feasibility. However it seems to me that the standard form given in Question 8.8 
would also be very suitable, using a cost function like (8.4.5) to restore feasibility. 
When solving linear constraint problems in which the objective is non-linear and 
when the solution x’ to any subproblem lies at a vertex, then similar estimates to 
the above can be made. These estimates are strict bounds if the function is convex. 
If the solution is not at a vertex then second order effects have to be introduced 
into (13.1.6) or (13.1.8) along the directions of zero slope which exist to avoid the 
trivial case p;’ = p;_ = 0. I do not know whether this possibility has been investi- 
gated in practice although the information to do it is available. Similar estimates 
can also be made in integer problems with non-linear constraints. 

An example in which these estimates are calculated is given by problem (13.1.5) 
whose solution is described in Figure 13.1.3. Consider possible branches at the 
solution of the root problem (1). Feasible edges are defined by 


patel gd Hope hal 
2996| -14 66 sh 


1 
= =! jj 
a ie sane ( 


and 


) 
646 }° 


Consider branching on x2: searching along either s,; or sy increases x» so it can be 
predicted that the branch x2 <67 is infeasible and p. =e, =. For increasing 
X25 


+ — min (848. 646) _ 646 
Me ( 82° 66] 66 
and ey" =0.09 x 646/66 © 0.9 which agrees with the observed increase in value in 


going from problem | to 3, since no other constraints become active. For branching 
on x4, 


P| = 848/28 ~ 30, DP, = 646/14 ~ 46 

e,' =0.74 x 30 = 22, e; ~0.26x 46 ~ 12 
which again agrees with observed changes in value. These estimates also illustrate 
the different rules for choosing branching variables. The ‘worst case’ rule in the 
stack method is achieved by e2~ = and gives rise to a branch on x. The ‘most 
important variable’ rule is solved by e, = 12 and causes a branch on x. In fact 


this is a much better branch since it yields problems 4 and 5 directly, and this 
supports the current preference for this type of rule. 


13.2 Geometric Programming 


Another important type of problem which it is possible to transform to the solution 
of a more simple problem is known as the geometric programming problem. This is 
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a nonlinear programming problem in which the problem functions are constructed 
from terms of the form 


n 
n= a ae, (i821) 
tf = 
where the a,; are real and may be negative or fractional for instance. The coef- 
ficients c; and t; however must be positive, and a sum of such terms is referred to 
as a posynomial. The problem 
minimize go(x) 
“ (13.2.2) 
subject to g;,(x)< 1, SES NUE Dae eB) 
x >0 


in which the functions g;,(x) are posynomials is a convenient standard form for a 
geometric programming problem. The terms which occur within go, g1,.--, %m 
are numbered consecutively from | to p, say, and sets J; which collect the indices 
of terms in each g, are defined by 


Ie Res Ce tl Sze 
so that (13:23) 
grec | Ne Peer seek and (pada): 


Thus the posynomials g;(x) are defined by 


“gp= 2 ty, 10 ere Ene (1322.4) 
iEJ, 
Geometric programming has met with a number of applications in engineering 
(see, for instance, Duffin, Peterson, and Zener (1967), Dembo and Avriel (1978), 
and Bradley and Clyne (1976), amongst many). There could well be scope for even 
more applications (either directly or by making suitable posynomial approxi- 
mations) if the technique were more widely understood and appreciated. 

The geometric programming problem (13.2.2) has non-linear constraints but a 
transformation exists which causes considerable simplification. In some cases the 
solution can be found merely by solving a certain linear system, and in general it 
is possible to solve a more simple but equivalent minimization problem in which 
only linear constraints arise. In this problem the relative contribution of the various 
terms f; is determined, and a subsequent calculation is then made to find the 
optimum variabies x*. This transformation was originally developed as an appli- 
cation of the arithmetic—geometric mean inequality (see Duffin, Peterson, and 
Zener, 1967), hence the name geometric programming. However a more straight- 
forward approach is to develop the transformation as an example of the duality 
theory of Section 9.5. First of all the transformation of variables 


x; = exp), | ae VOR ee) (325) 


is made, which removes the constraints x; > 0 and allows (13.2.1) to be written 
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t; = c; exp(Zajyy;). (13.2.6) 

In addition (13.2.2) is written equivalently as 
gente log go(y) 


(13.2.7) 
subject to log g,(y) <0, | IN OD ES (0 


The main result then follows by an application of the duality transformation to 
CiS2 2): 

So that this transformation is applicable it is first of all important to establish 
that each function log g;(y) is a convex function on IR”. To do this let yg, y; € IR” 
and 6 € (0, 1). Ifu; = [t;(yo)]'~° and v; = [t;(y,)]° are defined, then it follows 
from (13.2.6) that 


u;v; = ti(Yo) 


where yg = (1 — @)yo + Oy,. If u collects the elements u;, i€J;, v similarly, then 
uly =2;(yg). If p = 1/(1 — 0), g = 1/6 are defined, then Holders inequality 


u'v< lull, ivll, (13.2.8) 


is valid (since u; > 0, vj > 0, and p~' + q~* = 1), where llull, 4 (Zu?)'/?. Taking 
logarithms of both sides in (13.2.8) establishes that 


log gx(¥o) <1 — O)log gx (Yo) + 4 log gx(y1) 


and so log g;(y) satisfies the definition (9.4.3) of a convex function. 
It follows from this result that (13.2.7) is a convex programming problem (see 
(9.4.8)). Thus the dual problem (9.5.1) becomes 


m 
maximize W(y,A)Aloggo(y)+ 2 A, loggx(y) 
: a k=1 


: (13.2.9) 
subject to V,4(y, 4) = 0, 120. 
The Lagrangian can be written more simply by defining Ag = 1 so that 
m 
Ly, rA)= LV ry log gx(y). (13.2.10) 
k=0 


As it stands, (13.2.9) has non-linear constraints so it is important to show that it 
can be simplified by a further change of variable. Write the equality constraints as 
(A ee el eee (13.2.11) 
=0 Edi. [Sie 
These equations suggest defining new non-negative variables 5; by 


yee 
6:= a Verdi dips o> ON eens (13.2.12) 
k 





which correspond to the terms (13.2.1), suitably weighted. Then (13.2.11) becomes 
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P . 
= Capen mS ee, (13.2.13) 
l= 


and these equations are known as orthogonality constraints. Thus the equality con- 
straint in (13.2.9) is equivalent to a set of (under-determined) linear equations in 
terms of the 6; variables. An equivalent form of (13.2.4) is also implied by 

(13-2.1 2), that is 


Pins a pe eniek = Ohl pve. ey (13.2.14) 
i€J_ 


and is known as.a normalization constraint. Furthermore it is possible to simplify 
the dual objective function (13.2.10), because (assuming 0 log 0 = 0) 


Ax log gx = Ax log Ax + Ax log(gx/Ax) 
by (13.2.14) =x log A, + 2, 6; log(g;/Ax) 
TES, 


by (13.2.12). =A, logAy.+. Do .6,log(t)/8;) 
tel 
hn 
by (13.2.6) =x log Ax Fo 6; log(c;/6;) + > 6; » aij 
iCJ, iGJ, jf =1 
by (13.213) =A; logry + LE 8; log(c;/8,). (13.2.15) 
iGJ, 


Thus the weighted terms 5; and the Lagrange multipliers A, satisfy the following 
dual geometric programming problem 


P m 
maximize v(6,4)4 2 6; log(c;/5;)+ 2 A, log rx, 
i= k=1 


6.4 
Pp 
subject to 2 6;a;;=0 patito we A 
i=% 
pO One ot (13.2.16) 
iEJ, 


50 aes: 
>0 


where Ag = 1. 

This then is the resulting transformed problem which is more easily solved than 
(13.2.2). The problem has all linear constraints and a non-linear objective function 
so can be solved by the methods of Chapter 11. The number of degrees of freedom 
left when the constraints have been eliminated is known as the degree of difficulty 
of the problem, and is p — n — 1 if all the primal constraints are active. In some 
cases the degyee of difficulty is known to be zero, in which case (13.2.16) reduces 
just to the solution of the orthogonality and normalization equations. Usually 
however it is necessary to solve (13.2.16) by a suitable numerical method. In fact 
(13.2.16) has other advantageous features. The dual objective function v(6, 4) is 
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separable into one variable terms so that the Hessian matrix is diagonal and is readily 
calculated. If the A, variables are eliminated from (132.14) then the resulting 
function (of 8 only) is concave (see Duffin, Peterson, and Zener, 1967), so it 
follows from theorem 9.4.1 that any solution of (13.2.16) is a global solution. 

A more thorough study of how best to solve (13.2.16) by using a linearly con- 
strained Newton method of the type described in Chapter 11 is given by Dembo 
(1979). Amongst the factors which he takes into account are the need to eliminate 
in a stable fashion as described in Chapter 10 here. He also considers how best to 
take sparsity of the exponent matrix [a,;] into account, and extends (13.2.2) to 
include lower and upver bound constraints 


O<2; SX; Su;. 


(These can be written as posynomial constraints uj 1x; <1 and Q:x;7 1 <1 butit is 
more efficient to treat them separately.) A particularly interesting feature is that 
when a posynomial constraint g, <1 is inactive at the solution then a whole block 
of variables 6;, i€ J;,, and Ax, are zero. Thus Dembo considers modifications to the 
active set strategy which allow the addition or removal of such blocks of indices to 
or from the active set. 

Once the dual solution 6*, 4* has been found, then the primal function value is 
given by log gp = (8°, A*) (see theorem 9.5.1). It is also important to be able to 
recover the optimum variables y* and hence x” in the primal problem. In principle 
it is possible to do this by solving the non-linear equations (13.2.4) and (13.2.6) 
for the active primal constraints for which A; > 0 and hence gy = 1. However a 
more simple possibility is to determine the optimum multipliers we, j Sil ee ae eiahe 
of the orthogonality constraints in (13.2.16). These multipliers are available directly 
if (13.2.16) is solved by an active set method such as in Chapter 11.2. Then 


Ww; = yp = logx; (132517) 


determines the primal solution. To justify (13.2.17), the first order conditions 
(9.1.3) applied to (13.2.16) give 


log(6 ;/¢;) +1 =m, + Lay w,;, iC Jl, 
j (13.2.18) 
ogy — Puy 


where w; and lz are the multipliers of the orthogonality and normalization con- 
straints respectively and k indexes the active primal constraints. Eliminating u;, 
gives 


i 





6 
log a = Daj; W; 
OK J 


and since gx = 1, it follows from (13.2. 12) and (13.2.6) that 
pS ai Vj = = aj; Wj- 


Since the exponent matrix [a,;] has more rows than columns and may be assumed 
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to have full rank (see below), the result (13.2.17) follows. There is no loss of 
generality in assuming that [a@;;] has full rank. If not, then from (13.2.6) it is 
possible to eliminate one or more y; variable in favour of the remaining y; variables, 
so as to give a matrix [@,;] of full rank. 

Finally a simple example of the geometric programming transformation is given. 
Consider the geometric programming problem in two variables with one constraint 


one 2 
minimize go(x) ae xii? +X 1X> 
x 1*2 (13219) 
subject to g)(x) 43x, +x» <1, tists 
There are four terms, so the orthogonality constraints are 
— 6, +6, +63 =a 
45, +68, +64=0 
and the normalization constraints are 
6, +65 =1 
63 ae 54 = hy . 
Assuming that the primal constraint is active, the 6; variables can be eliminated in 
terms of A, (= A) giving 
(24+ 4, -2A+3,3rA—1,444+1)7 


(A) = > 


(13.2.20) 
This illustrates that there is one degree of difficulty (= 4 — 2 — 1) and therefore a 
maximization problem in one variable (A here) must be solved. Maximizing the dual 
objective function v((A), d) by a line search in ) gives \* = 1.0247 and 

v* = 1.10764512. Also from (13.2.20) it follows that 


d* = (0.86420, 0.13580, 0.72841, 0.29630)'. 


Thus the dual geometric programming problem has been solved, and immediately 
Zo = exp v* = 3.02722126. Substituting in (13.2.18) for k= 1 and J; = {3, 4} gives 
ui = 1.02441 and hence 


ere b o| vi) 

—1.24078/ Lo 1}\wz/' 

Hence by (13.2.17) the vector of variables which solves the primal is x= (24217, 
0.28915)!. In fact the primal problem is fairly simple and it is also possible to 
obtain a solution to this directly by eliminating x and carrying out a line search. 


This calculation confirms the solution given by the geometric programming trans- 
formation. 
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Questions for Chapter 13 
1. Solve graphically the LP 


minimize —x, — 2x, 
subject to —2x, + 2x2 <3 
2x ab 2x2 < 9, 


for the cases (i) no integer restriction, (ii) x integer, (iii) x,, x2 integer. Show 
how case (iii) would be solved by the branch and bound method, determining 
problem lower bounds by L = f' t+ e;* or L= f' + e;_. Give the part of the 
solution tree which is explored. Show also that although the method solves the 
problem finitely, the tree contains two infinite paths. Show that if the nearest 
integer is used as a criterion for which problem to explore first after branching, 
then it is possible that one of these infinite paths may be followed, depending 
on how ties are broken, so that the algorithm does not converge. Notice how- 
ever that if the best pending node is selected after every branch then the algor- 
ithm does converge. If bounds x > 0 are added to the problem, enumerate the 
whole tree: there are ten © nodes, six @ nodes and five 0 nodes in all. 

2. Solve the integer programming problem 





minimize x? + x4 + 16(x,x2 + (4+x2)*) 
subject to x,, X2 integer 


by the branch and bound method. In doing this, problems in which x; <n; or 
x; 2n; + | arise. Treat these as equations, and solve the resulting problem by 
a line search in the remaining variable (or by using a quasi-Newton subroutine 
with the ith row and column of H™) zeroed). Calculate Lagrange multipliers 
at the resulting solution and verify that the inequality problem is solved. Use 
the distance to the nearest integer as the criterion for which branch to select. 
Give the part of the solution tree which is explored. 

3. Consider the geometric programming problem 


minimize xy ox ee 

subject to'-x7 + 2X5 t 2x5 = 72) x>0 
derived from Rosenbrock’s parcel problem (see Question 9.3). Show that the 
problem has zero degree of difficulty and hence solve the equations (13.2.13) 
and (13.2.14) to solve (13.2.16). Show immediately that the solution value is 
1/3456. Find the multipliers Ww}, j = 1, 2,3, of the orthogonality constraints 


and hence determine the optimum variables from (13.2.17). 
4. Consider the geometric programming problem 
minimize 40x,x 2 + 20x2x3 
subject to $xy!xz1/? +3x5 Fy5 sc), x>0. 


Show that the problem has zero degree of difficulty and can be solved in a 
similar way to the previous problem. 


Ley 


5. Consider the geometric programming problem 


minimize 40x; !xz!/?xz! + 20x, x3 + 40x, x2x3 


subject to $x; 2x72 + $x4/?x5! <1, x>0. 
Show that the problem has one degree of difficulty. Hence express 6 in terms 
of the parameter A (= A, ) as in (13.2.20), and solve the problem numerically by 
maximizing the function v(6(A), A) by a line search in X. Then solve for the 
variables using (13.2.17). This and the previous problem are given by Duffin, 
Peterson, and Zener (1967) where many more examples of geometric program- 
ming problems can be found. 


Chapter 14 


Non-differentiable Optimization 


14.1 Introduction 


In Volume 1 on unconstrained optimization an early assumption is to exclude all 
problems for which the objective function f(x) is not smooth. Yet practical prob- 
lems sometimes arise (non-differentiable optimization (NDO) or non-smooth 
optimization problems) which do not meet this requirement; this chapter studies 
progress which has been made in the practical solution of such problems. Examples 
of NDO problems occur when solving non-linear equations c;(x) = 0, i = 1, 2, 
...,m (see Section 6.1, Volume 1) by minimizing ll c(x) ll, or Il c(x) ll... This 
arises either when solving simultaneous equations exactly (m =n) or when finding 
best solutions to over-determined systems (m >n) as in data fitting applications. 
Another similar problem is that of finding a feasible point of a system of non-linear 
inequalities c;(x) <0, i= 1,2,...,m, by minimizing !lce(x)* ll, or lle(x)* Il. 
where c;' = max(c;, 0). A generalization of these problems arises when the 
equations or inequalities are the constraints in a nonlinear programming problem 
(Chapter 12). Then a possible approach is to use an exact penalty function and 
minimize functions like vf(x) + || c(x) ll or vf(x) + lle(x)* Il, in particular using the 
L, norm. This idea is attracting much research interest and is expanded upon in 
Section 14.3. Yet another type of problem is to minimize the max function 
max;C;(x) where the max is taken over some finite set. This includes many examples 
from electrical engineering including microwave network design and digital filter 
design (Charalambous, 1979). In fact almost all these examples can be considered in 
a more generalized sort of way as special cases of a certain composite function and a 
major portion of this chapter is devoted to studying this type of function. 

Another common source of NDO problems arises when using the decomposition 
principle. For example the LP problem 

minimize c!x + dly 
x,y 
subject to Ax + By <b 


can be written as the convex NDO problem 


minimize f(x) 4 c!x + min[d'y : By < b — Ax]. 
x M/ 


iva 
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Application of NDO methods to column generation problems and to a variety of 
scheduling problems is described by Marsten (1975) and the application to network 
scheduling problems is described by Fisher, Northup, and Shapiro (1975). Both 
these papers appear in the Mathematical Programming Study 3 (Balinski and Wolfe, 
1975) which is a valuable reference. Another good source is the book by Lemarechal 
and Mifflin (1978) and their test problems 3 and 4 illustrate further applications. 

In so far as I understand them, all these applications could in principle be formu- 
lated as max functions and hence as composite functions (see below). However in 
practice this would be too complicated or require too much storage. Thus a dif- 
ferent situation can arise in which the only information available at any point x 
is f(x) and a vector g. Usually g is Vf, but if f is non-differentiable at x then g is an 
element of the subdifferential df (see Section 14.2) for convex problems, or the 
generalized gradient in non-convex problems. Since less information about f(x) 
is available, this type of application is more difficult than the composite function 
application and is referred to here as basic NDO. However both types of application 
have some structure in common. 

In view of this common structure, algorithms for all types of NDO problem are 
discussed together in Section 14.4. For max functions it is shown that an algorithm 
with second order convergence can be obtained by making linearizations of the 
individual functions over which the max is taken, together with an additional 
(smooth) quadratic approximation which includes curvature information from the 
functions in the max. Variants of this algorithm have been used in a number of 
practical applications. The algorithm is closely related to one in which the compo- 
site function is approximated in a similar way. This algorithm can be made globally 
convergent by incorporating the idea of a trust region and details of this result are 
given in Section 14.5. The algorithm is applicable to both max function and L, 
and L.. approximation problems and to nonlinear programming applications of 
non-differentiable exact penalty functions. Algorithms for basic NDO have not 
progressed as far because of the difficulties caused by the limited availability of 
information. One type of method — the bundle method — tries to accumulate 
information locally about the subdifferential of the objective function. Alterna- 
tively linearization methods have also been tried and the possibility of introducing 
second order information into these algorithms is being considered. In fact there is 
currently much research interest in NDO algorithms of all kinds and further develop- 
ments can be expected. 

A prerequisite for describing NDO problems and algorithms is a study of opti- 
mality conditions for non-differentiable functions. This can be done at various 
levels of generality. I have tried to make the approach here as simple and readable 
as possible (although this is not easy in view of the inherent difficulty of the 
material), whilst trying to cover all practical applications. The chapter is largely 
self-contained and does not rely on any key results from outside sources. One 
requirement is the extension of the material in Section 9.4 on convex functions 
to include the non-differentiable case. The concept of a subdifferential is intro- 
duced and the resulting definitions of directional derivatives lead to a simple state- 
ment of first order conditions. More general applications can be represented by the 
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composite function 
(x) 4 f(x) + h(e(x)) (14.1.1) 


which I shall refer to as composite NDO. Here f(x) (IR” > IR") and c(x) (IR? > IR”) 
are smooth functions (€!), and h(c) (IR > IR') is convex but non-smooth (€°). 
In some applications the smooth function f(x) may be absent. Important special 
cases of h(c) are set out in (14.1.3) below. Optimality conditions for (14.1.1) caa 
be obtained as a straightforward extension of those for convex functions. This 
material is set out in Section 14.2 and both first and second order conditions are 
studied. There are NDO problems (for example (x) = max(O, min(x;, x2))) which 
do not fit into this category and which are not even well modelled by (14.1.1) 
locally, although I do not know of any which arise in practice. Wider classes of 
function have been introduced which cover these cases, for example the ‘locally 
Lipschitz’ functions of Clarke (1975), but substantial complication in the analysis 
is introduced. Also these classes do not directly suggest algorithms as the function 
(14.1.1) does. Furthermore the resulting first order conditions permit descent 
directions to occur and so are of little practical use (Womersley, 1980). Hence these 
classes are not studied any further here. 

In most cases (but not all — for example the exact penalty function ¢= vf + llc lz) 
a further specialization can be made in which h(c) is restricted to be a polyhedral 
convex function. In this case the graph of h(c) is made up of a finite number of 
supporting hyperplanes c'h; + b;, and A(c) is thus defined by 


h(c)4 max(c'h; + b;) (14.1.2) 
i 


where the vectors h; and scalars b; are given. Thus the polyhedral convex function 
is‘a max function in terms of linear combinations of the elements of c. A simple 
illustration is given in Figure 14.1.1. Most interest lies in five special cases of 
(14.1.2), in all of which 5; = 0 for all i. These functions are set out below, together 
with the vectors h; which define them given as columns of a matrix H. 





Figure 14.1.1 Polyhedral convex function 
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Case I h(c)=max;c; H=I ? (m xm) 
Case II A(c)=Ie* ll. H= [L: 0] (m x (m+ 1)) 
Case IIT h(c)= Ic ll. H = [I: —I] (m x 2m) 


Case IV h(c)= Ilc* Il, columns of H are all possible combinations — (14.1.3) 
of 1 andO (m x 2”) 
Case V h(c)= IIc ll, columns of H are all possible combinations 


of 1 and —1 (7.x 22) 


For example with m = 2 incase (V) the matrix H is H= F ie th al é 


1 —-1 1 -1 
Contours of these functions for m = 2 are illustrated in Figure 14.1.2. The broken 
lines indicate the surfaces along which the different linear pieces join up and along 
which the derivative is discontinuous. The term piece is used to denote that part of 
the graph of a max function in which one particular function achieves the maximum 
value, for example the graph illustrated in case (ID) of Figure 14.1.2 is made up of 





Case I = A(c) = max;c; 





Cose I: A(c)=Ic*h, Cose I: A(c)= Ici, 


Co C2 





Cose IW: Alc) = Nell, Cose WV A(c) = Nell, 


Figure 14.1.2 Contours of polyhedral convex functions 
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three linear pieces and the active equations are h(c) =c2 fore, <cz and cz 20, 
h(c)=c, fore, <c, and c, = 0, and h(c) = 0 for cy <0 and cz <0, The subdiffer- 
ential dA(c) for a polyhedral convex function has the simple form expressed in 
lemma 14.2.2 as the convex hull of the gradients of the active pieces. The poly- 
hedral nature of A(c) has important consequences in regard to second order con- 
ditions, and this is considered later in Section 14.2. 

In fact it is possible to derive optimality results for the composite function $(x) 
in (14.1.1) when h(c) is the polyhedral function (14.1.2) without using notions of 
convexity at all. Clearly (x) can be written 


o(x) = max(f(x) + e(x)"h; + ;) (14.1.4) 
I 


which is equivalent to 
¢(x)=minv: v>f(x)t+e(x)"h, +b; Vi (14.1.5) 


Therefore x* minimizes ¢(x) locally iff x*, v* is a local solution of the nonlinear 
programming problem 

minimize v 

X,U (14.1.6) 

subject to v — f(x) — c(x)"h; > 5; Vii. 
Thus first and second order optimality conditions for (14.1.4) can be obtained by 
applying the equivalent results in nonlinear programming to (14.1.6), that is 
theorems 9.1.1, 9.3.1, and 9.3.2. It turns out however that this approach is less 
general and somewhat clumsy, so it is only sketched out briefly here, and the 
derivation given in Section 14.2 is preferred. The first difficulty concerns the 
regularity assumption ¥ ' = F'’. It is possible (but not trivial — see Question 14.1) 
to prove that this always holds, and therefore it is important not to make an 


unnecessary independence assumption. The appropriate Lagrangian function for 
theorem 9.1.1 is 


P(x, v,m) =v — Duj(v— f(x) — e(x)"h; — by) (14.1.7) 
U 
and it is implied at x*, v* that multipliers * exist such that 
a) 
a Le Doi 0 or 2h, = 
v 
V9( 0 6) 30 or Du; (g* + A*h;)=0 (14.1.8) 
u*>0 
up > 0 > uv =f* +c*Th; +5;. 


As in previous chapters the notation g = Vf and A = Ve! is used and f* = f(x*), 
etc. If A= Hp is written, then the existence of a vector p* in the above is equiv- 
alent to the existence of a vector A” in the set 

dh* = conv h; (14.1.9) 


(eae 
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such that g* + A*A* = 0. In fact 3h” is the subdifferential of h(c*) as shown in 
lemma 14.2.2. .* is the set of active constraints or equivalently the set of indices 
at which the max in (14.1.4) is attained, that is 


of* ={i:0*Th; +b; =h(c*)}. (14.1.10) 


It is possible to use (14.1.9) to obtain equivalent but more convenient ex- 
pressions for dh(c) in cases (I) to (V) in (14.1.3) above as follows: 


0 max;c;={A: DA;=1, ASO, c; <max;¢; > d; = 0} (14.1.11) 
Ole* Ne={A: DAy <1, ASO, ¢; < Ve* I. > A; = 0} (14.1.12) 
d llc. ={4: c#0> DA, =1 

le;|<Ilell. >A, =0 


le; | = lle ll. > Aye; =O 
c=05 2D; <1} (14.1.13) 
Olle I, ={4: OSA; <1, c; >0>A, =1 
¢; <0, = 0} (14.1.14) 
0 Ile ll, ={A: |A; | <1, c; HO =A, = sign c;}. (14.1.15) 


The equivalence between these sets and those defined by conv; < yh; is left as an 
exercise (Question 14.3). Expressions (14.1.12) to (14.1.15) can also be derived as 
special cases of (14.3.7) or (14.3.8). 

It is also possible to apply the development of (9.3.9) onwards to problem 
(14.1.6) to obtain the second order conditions of theorems 14.2.2 and 14.2.3, and 
also the regularity condition (14.2.25) (see lemma 14.2.7). To do this an index 
p © oA is used to eliminate v (or s,,4, — see Question 14.1). The analysis is not 
entirely straightforward so again the more direct approach in Section 14.2 is 
preferred. The equivalence of the two different approaches is however shown by 
lemma 14.2.8. 

The parameters * € dh* which exist at the solution to an NDO problem are 
closely related to Lagrange multipliers and indeed they can be given a simple 
interpretation similar to (9.1.10). Consider a perturbed problem in which c(x) is 
replaced by c(x) + € giving a function ¢,(x) and assume that the minimizer x(€) is 
such that the same constraints are active in (14.1.6). It follows that the right hand 
side of each active constraint is perturbed by an amount e'h,. Since y; is the 
multiplier of the ith constraint and using (9.1.10) it follows that the change to v 
is Av = 2;€"h;u; = €! A* and hence that 


* 
ae }. (14.1.16) 
de; 
Thus ; measures the first order rate of change of the optimum function value 
consequent on perturbations to c;. This result can also be used to illustrate the need 
for the condition A> 0 in (14.1.11). For example if the first order conditions arising 
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Xieaex ea) 


Case (1 ) re >0, eo xis optimal Case ( 11.) = O, MS =O; x is not optimal 


Figure 14.1.3 Interpretation of multipliers in NDO 


from (14.1.8) (4 4* € dh* such that g* + A*2* = 0) are satisfied except that 

hj <0 for some i, then it is possible to show that x” is not optimal. Consider a 
perturbation € with e; > 0 and e; = 0, j #7. Existence of the solution x(€) follows 
under a suitable independence assumption using the implicit function theorem. 

At x(€) the max in (14.1.2) is achieved by all 7 € /%* if € is small enough, so for all 
(ES, jf Hi, 


cj (x(€)) = c;(x(E)) + &;. 


Thus c;(x(€)) < c;(x(€)) and hence $,(x(€)) = d0(x(€)) where ¢$o refers to the 
unperturbed function. It follows fromi (14.1.16) that dég/de; = 47 <0 and so x* 
is not a local solution. The situation is illustrated in Figure 14.1.3. Clearly increas- 
ing c, by e does not reduce $(x) in case (i) when A> 0 but does reduce $(x) in case 
(ii). Another example of this is given in Question 14.12. 

The requirement that the active constraints in (14.1.6) remain the same is 
important and is in the nature of an independence assumption. This usually holds 
for case (I) and (II) problems in (14.1.3) and for case (III) problems when c* #0. 
However for the L, norm functions of cases (IV) and (V) the assumption is not 
likely to hold. An alternative interpretation of Lagrange multipliers suitable for the 
L, case is derived at the end of Section 14.3. 

Finally it is observed that it is possible to attack the problem of constrained 
NDO, although no details are given here. First order conditions for smooth con- 
straint functions are given by Watson (1978) and for constrained L.. approxi- 
mation by Andreassen and Watson (1976). First and second order conditions for 
a single non-differentiable constraint function involving a norm are given by 
Fletcher and Watson (1980). There is no difficulty in extending these ideas to a 
set of constraints defined in terms of composite functions like (14.1.1). 


14.2 Optimality Conditions 


In this section a number of results are proved leading to optimality conditions for 
the composite functions described in the previous section. Since these functions 
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Figure 14.2.1 Supporting hyperplanes to a non- 
differentiable convex function 


are defined in terms of non-differentiable convex functions it is important to study 
the latter case first. Such a function is illustrated in Figure 14.2.1 and it can be seen 
that the non-differentiability at x* allows the possibility of a number of supporting 
hyperplanes, in contrast to Figure 9.2.2. (Supporting hyperplanes to the graph of a 
convex function, which is a convex set, always exist; see Hadley, 1961.) Given a 
convex function f(x) defined on a convex set K C IR”, and given x € interior (K), 
then for each supporting hyperplane at x it follows that an inequality of the form 


f(x+8)>f(x)+ 8g Vx+85EK (14.2.1) 


holds, and g is a normal vector of the supporting hyperplane. Such a vector g is 
referred to as a subgradient at x and (14.2.1) is known as the subgradient inequality 
and is the generalization of (9.4.4) to non-smooth functions. The set of all sub- 
gradients at x is known as the subdifferential at x and is defined by 


Of(x) 4 {g: f(xt+&)>f(x)+ bg V xt+8EK}. (14.2.2) 


The notation of) = af(x?) analogous to (AP fx), etc., is used. It is easy to 
see that 0f(x) is a closed convex set, which by lemma 14.2.1 below is also bounded 
and therefore compact. In fact a more general result can be established. 


Lemma 14.2.1 
Of (x) is bounded for all x € B C interior (K) where B is compact. 


Proof 


If the lemma is false, 3 a sequence g“*) € af(x), x EB, such that Ig Il, >, 
By compactness J a subsequence x) > x’. Define 6 = g/g IF. Then 
x) + §) € K for k sufficiently large, so by (14.2.1) 


FO + 8) > fH +g)" EH) = fH) 41, 
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| But in the limit, f) > ¢’, 8) > 0, and so f(x + §(*)) — f', which is a contra- 
diction, so the lemma is true. 0 


If f is differentiable at x, then 
f(x + 8) = f(x) + & V(x) + oll Sl) 
and subtracting this from (14.2.1) gives 
8'(g — Vf(x)) <o(I 81). 


Choosing 6= 6(g — Vf(x)), 9 + 0, shows that g = Vf. Hence in this case f(x) is the 
single vector Vf(x). 

In Section 14.1 the particular class of polyhedral convex functions h(c), 
IR” > IR!, is defined by 


h(c) = max cth; + b; (14.2.3) 
i 


where h; are the columns of a given finite matrix H. Defining 
sf = od (c) 4 {i: th; + b; = A()} (14.2.4) 


as the set of supporting planes which are active at c and so achieve the maximum, 
then it is clear that these planes determine the subdifferential dh(c). This is proved 
as follows. 


Lemma 14.2.2 


dh(c)= conv h; (14.2.5) 
iE ¥V(Cc) 


Proof 


dh(c) is defined by 
dh(c)={A: h(c+8)>h(c) + 6A = V'B}. (14.2.6) 
Let A= Hp € (14.2.5) where uw; > 0, Ly; = 1. Then for all 6, 
h(c) 35 8A = max(cth; cE b;) Ge » 8Th,y,; 
i iced 
< max(c'h; + b;) + max 8°h; 
i€a iG 


= max((c +8)'h; + b;)< h(c + 8). 
iE ae 


Thus 4€ (14.2.6). Now let AE (14.2.6) and assume A € (14.2.5). Then by lemma 
14.2.3 below, 4s #0 such that s'A>s"p W we (14.2.5). Taking 8 = as, and since 
hes) Wa ae 
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h(c) + 8'A = max(cth; + b;) + as? A 
i 


S eth; ot b; + asth; V iE VS 


= max((c + as)"h; + Bj) 
iE aS 


> max((c + as)"h; + b;)=h(c + 8) 
i 
for a sufficiently small, since the max is then achieved on a subset of .% Thus 


(14.2.6) is contradicted, proving 4€ (14.2.5). Hence the definitions of dh(c) in 
(14.2.5) and (14.2.6) are equivalent. O 


Examples of this result for a number of particular cases of h(c) and H are given 

in more detail in Section 14.1. For convex functions involving a norm, an alter- 
native description of dh(c) is provided by (14.3.7) and (14.3.8). The above lemma 
makes use of the following important result. 


Lemma 14.2.3 (Separating hyperplane lemma for convex sets) 


If K is a closed convex set and X.€ K then there exists a hyperplane which separates 
4 and K (see Figure 14.2.2). 


Proof 


Let Xo € K. Then the set {x: Ilx — Ally < Ilxg — All} is bounded and so there 
exists a minimizer, X say, to the problem: min|lx — All,, x © K. Then for any 
eh 


(1 — @)x + 6x — 113 > oa ales 
and in the limit 8 J 0 it follows that 
(x—x)'(A—x)<0 V xeEK. 


Thus the vector s= A — x £0 satisfies both s'(4— x) > 0 and s' (x — x) <0 
V x € K and hence 


see x OVX EK: (14.2.7) 
The hyperplane s!(x — x) = 0 thus separates K and as illustrated in Figure 
14.2.2. EI 


At any point x’ at which Vf does not exist, it nonetheless happens that the 
directional derivative of the convex function in any direction is well defined. It is 
again assumed that f(x) is defined on a convex set K C IR” and x’ € interior (K). 
The following preliminary result is required. 


Lemma 14.2.4 
Let x*) + x' and g*) €af™. Then any accumulation point of {g} is in of’. 
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Separating hyperplane 


Figure 14.2.2 Existence of a separating hyperplane 


Proof 

For any y € K, (14.2.1) can be written as 
fi) 2h ae Na 

Taking any subsequence for which gh) > g’ it follows that 
fal Uy=x ie Vey ek 

that is g’ Gof’. O 


A result for the directional derivative at x’ in a direction s can now be given in a 
quite general way for any directional sequence (see Section 9.2). 


Lemma 14.2.5 


If x) > x' is any directional sequence with 5“) | 0 ands“) > § in (9201) 
(or more generally 5“) > 0 and §*) > 0), then 


k)_¢!' 


lim ——~ = max s'¢. 14.2.8 
peo 5) gear’ : ( ) 


Proof 

if g*) €af¢™) then it follows from (14.2.1) that for k sufficiently large 
Vee § Cg) T gC) 

and 
FO > fi tsM6"g vy gear’ 

both hold, and hence 
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(k) ' 
C5 5a et ee ae T 
s gS 2 max sg. 14.2.9 
5(k) sey g ( ) 
Since af“) is bounded in a neighbourhood of x’ (lemma 14.2.1), there exists a 
subsequence for which g*) > g’, and g’ € af’ by lemma 14.2.4. If (14.2.8) is not 


true then (14.2.9) gives a contradiction in the limit of such a subsequence. O 


Thus (14.2.8) shows that the directional derivative is determined by an extreme 
supporting hyperplane whose subgradient gives the greatest slope: see Figure 14.2.1 
for example. 

It is possible to deduce from this result that if x* is a local minimizer of f(x), 
then eo > f* for allk sufficiently large, and hence from (14.2.8) that 


max sig>O V s:Ilsll=1. (14.2.10) 
geor* 
Thus a first order necessary condition for a local minimum is that the directional 
derivative is non-negative in all directions. This can be stated alternatively as 


0€of* (14.2.11) 


which generalizes the condition g* = 0 for smooth functions. Clearly (14.2.11) 
implies (14.2.10). If O¢ Of’ then by lemma 14.2.3 (with A= 0, K = df’) there 
exists a vector s = —g/Ilgz ll, for which s'g <0 V g©9Of’, where g is the vector 
which minimizes llg ll, V g€ df’. Applying this result at x* shows that (14.2.10) 
and (14.2.11) are equivalent. It is now immediate from (14.2.1) that either 
(14.2.10) or (14.2.11) is also a sufficient condition for a global minimizer at x”. 
In fact the vector s= —g/!lg Il, defined above is the steepest descent vector at x’. 
Assuming 0 ¢ df’ then from (14.2.8) the direction of least slope is defined by 


min max sig= max min slg 
Isl, =1 gear’ geof’ Iisil,=1 
= max —Igl,=—Ig ll, (14.2.12) 
geof’ 


and hence the least slope is achieved when s = —g/Ilg ll. The justification of inter- 
changing the min and max operations is explored in Question 14.7. 

The main aim in introducing the above development is to apply it to composite 
functions of the form 


o(x) = f(x) + A(c(x)) (14.2.13) 


where f(x) (IR” > IR!) and c(x) (IR” > IR) are smooth (€') functions and h(c) 
(IR — IR’) is convex but non-smooth (€°). Let x“) > x’ be a directional sequence 
with 5*) | 0 and s“*) > s in (9.2.1). By Taylor series 


so (f — f')/5® > g'Ts. Likewise 


co) = 0 t fPMA'TMH + 0 (6) 
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so ef*) + ¢' is a directional sequence in IR’ with (c(*) — c')/5*) > A'Ts. Thus by 
applying lemma 14.2.5 to h(c), it follows that 
gp) —¢' ’ ! 
lim 75 = max si (g' + A’A) (14.2.14) 
k>co «(6 Nene 
and this gives the directional derivative at x’ in the direction s for the function (x) 
in (14.2.13). Another more general result about directional derivatives is proved in 
lemma 14.5.1 and its corollary. 
It can be deduced from (14.2.14) that if x* is a local minimizer of $(x), then 
o*) > ¢* for all k sufficiently large, and hence 
max si(g*+A*A)>0 Vs: Isi=1. (14.215) 
16 0h* 
This is a first order necessary condition for a local minimizer which like (14.2.10) 
can be interpreted as a non-negative directional derivative in all directions. Again 
the result can be stated alternatively as 


OE dd(x*) Aly: y=gtAA VW ACOM}y=x*. (14.2.16) 


The set 3¢* thus defined, although convex and compact, is not the subdifferential 
because ¢ may not be a convex function, but it is convenient to use the same 
notation. (It is in fact the generalized gradient of Clarke (1975).) Its definition in 
(14.2.16) is in the nature of a generalized chain rule, by analogy with the ex- 
pression V¢ = g + AVA for smooth functions. The equivalence of (14.2.15) and 
(14.2.16) is again a consequence of the separating hyperplane lemma 14.2.3. Yet 
another way to state this condition is to introduce the Lagrangian function 


SY (x,A) = f(x) + ATe(x). (14.2.17) 


Then an equivalent statement of (14.2.16) is as follows. 


Theorem 14.2.1 (First order necessary conditions) 
If x* minimizes $(x) in (14.2.3) then there exists a vector 4* € dh* such that 
V(x", 4.*)= 9" + A*2" =0. (14.2.18) 


Proof 


Immediate since 0¢* is the set of vectors VA(x*, 2) for all A € on™. (a 


This form illustrates the close relationship between the vector A* € dh* and the 
Lagrange multipliers in theorem 9.1.1. These conditions are illustrated by Questions 
14.8, 14.9, 14.10, and 14.12. In general, since ¢(x) may not be convex, the con- 
ditions of theorem 14.2.1 are not sufficient. 

In view of this last observation it is important to consider second order con- 
ditions for x* to be a minimizer of the composite function (14.2.13). These 
conditions again exhibit a close relationship with the results in Section 9.3. The 
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Figure 14.2.3. Contours for the scaled L, Freudenstein and 
Roth problem, Question 14.9 


approach is based on that taken by Fletcher and Watson (1980). In considering 
second order conditions it is necessary to restrict the possible directions to those 
having zero directional derivative so that second order effects become important. 
This is illustrated by the contours in Figure 14.2.3. x™ satisfies first order conditions 
but there are directions of zero slope along the derivative discontinuity c,(x) = 0 
(broken line), so that first order conditions are not sufficient. However sufficient 
second order conditions can be derived which imply that x* is a local minimizer — 
see Question 14.9. In general let 4* be any vector which exists in theorem 14.2.1 
and consider the set 


X ={x: h(c(x)) = h(c(x*)) + (c(x) — e(x*))* a*}. (14.2.19) 


Define ¢* as the set of normalized feasible directions with respect to X at x™. 
(That is s€ Y* implies that there exists a directional sequence x") + x* feasible 
in (14.2.19), such that s“*) + gs in (9.2.1) with o = 1.) It is possible to show that 
these directions are closely related to the set G* of normalized directions of zero 
slope, that is 


G* 4{s: max s!(g* + A*A)=0, Ils ll, = 1}. (14.2.20) 
4. 0h* 
The extent to which Y* and G* correspond is important and it is first shown that 
G* is a subset of G*. 
Lemma 14.2.6 
Ca, G 


Proof 


Let s€ Y* soa directional sequence in X exists with s(*) + Ils Il, = 1. Using 
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(14.2.14), (14.2.13), (14.2.19), and a Taylor series it follows that 








T;..* * : (4) _ gr 
EN i ey 
Ea fe CE hes) 
Eas 6 (*) 
Bay od Bik ed ad aN 
ee 6 (*) 
= sl(g* +.A*2'). 
It then follows from (14.2.18) that s€ G*. O 


To give a general result going the other way is not always possible although it 
can be done in important special cases. Further discussion of this is given later in 
this section: at present the regularity assumption 


G*=G* (14.2.21) 


is made (which depends on A* if more than one such vector exists). It is now poss- 
ible to state the second order conditions. In doing this it is assumed that f and c are 
€* functions. Note that as usual the regularity assumption is needed only in the 
necessary conditions. 


Theorem 14.2.2 (Second order necessary conditions) 


If x* minimizes (x) then theorem 14.2.1 holds; for each vector 4* which thus 
exists, if (14.2.21) holds, then 


s'V79(x*, a*)s>0 W sea. (14.2.22) 


Proof 


For any s€G*,s © Y” by (14.2.21) and hence Ja directional sequence x*) + x*, 
feasible in (14.2.19). A Taylor expansion of (x, 2*) about x* yields 


LK), 1*) = ft + ch a* + OV GER, 2°) 
+ he "V2 2a*, ae) + 0( Ile I?) 
where e(*) = x(*) _ x*, and using (14.2.18) and (9.2.1), 
LR, = fF + FTA + 4500900 TW? 2K*, 2%) 3) + 0(5 (7), 
(14.2.23) 
Since x‘*) is feasible in (14.2.19) it follows from (14.2.13) that 
GO) = OF + 16) g) "V2 ZH 4%) 5K) + 0(6(*)"), 


187 


Since x? is alocal minimizer, 6) > ¢* for all k suf fficiently large, so dividing by 
15(*)? and taking the limit yields (14.2.22). a Esl 


For sufficient conditions, it is firstly observed that if the directional derivative is 
positive in all directions, that is 


max. si(g* + A* 4) >0 Vs: Ist=1, 
Ean 


or equivalently G* is empty, then the conditions of theorem 14.2.1 imply that x* 
is an isolated local minimizer (see Question 14.12). This result is in fact a special 
case of theorem 14.2.3 below. When G* is non-empty, second order effects come 
into play and this is expressed in the following. 


Theorem 14.2.3 (Second order sufficient conditions) 
If there exists 4* € dh* such that (14.2.18) holds, and if 
S1V2 “(x51 )se2. 04 « Vues eG (14.2.24) 


then x* is an isolated local minimizer of (x). 


Proof 


Assume the contrary, that 3 a sequence and hence a directional sequence x(k) + x* 
such that 6) < $*. By (14.2.18), 


O< max s'(g*+ A*A)=u 
1E0h* 


say. If u >0 then lim (g°”? — ~*)/5) = p (see (14.2.14)) which contradicts 
o) < $*. Thus u = 0 and hence s € G*. Now from (14.2.17) and (14.2.13) 


Oa) Maya ee gat SFr oth * 
a9 — (WE) —h(e")i= (cll) = eV") 
< go) — 9" 





using the subgradient inequality. Hence from (14.2.23) 
O> 6) — gt > $5) V2GK*, a*)s +0 (6), 


Dividing by hg)? and taking the limit contradicts (14.2.24) and establishes the 
theorem. 0 


These conditions (with G* non-empty) are illustrated by Questions 14.9 and 
14.10. 

The second order conditions for nonlinear programming given in Section 9.3 
are almost necessary and sufficient because the regularity assumption is mild and 
there is a gap only in the case of zero curvature. The same is not ue here and it is 
important to realize that there are well- a cases in which Y* #G*. However 
there are also two important cases in which g* =G* and the conditions are there- 
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fore almost equivalent. One case is when h(c) is lle ll or Ile* I, and when c*=0or 
e** = 0. This is described in Section 14.3 in the context of applying an exact 
penalty function to solve a nonlinear programming problem. Then an assumption 
that the vectors a; of the active constraints are independent yields ¢g*=G* 
(lemma 14.3.1). Another case is when /(c) is a polyhedral convex function in 
which case a different independence assumption also ensures that g* =G*,as 
follows. 


Lemma 14.2.7 


If h(c) is defined by (14.2.3), if w* are the multipliers which exist in (14.1.8), if p is 
an index with Up > 0, and if the vectors 


A*(h, —h,), i€.v*—p (14.2.25) 


are linearly independent, then g*=G". 


Proof 
Let 4* = Dy*h; be the corresponding vector in 0h* and let s/f 4 {i: u; > O}. Let 
s€G* so that 


max s!(g* + A*4)=0. 
A4€dh* 


Let .o* 4 {i: s1(g* + A*h;) = 0, 1€.9*} denote the set of vectors h; (dependent 
on s) which achieve the max. By (14.2.18) X* also achieves the max so ve >0 
implies 7 Ea? and hence .} Cwet C .x*. Thus for any DEW), 


s'A*(h, — h,)= 0, i€v*_p 
s'A*(h, — h;)>0, Pe * —f*. 


If (14.2.5) holds, «”* — p contains fewer than n elements (else lls ll = 1 is contra- 
dicted) and so (as in the proof of lemma 9.2.2) there exists an arc x(@) with 
x* = x(0) and X(0) = s defined for @ > 0 sufficiently small such that for @ >0 


hyc(x(6)) +. Dy:= h}e(x(@)) + b;, i€ASs—p 
h}c(x(@)) + by > hf e(x(6)) + By, iC of* — f* 


Thus for 6 sufficiently small, h(e(x(@))) = h} e(x(@)) + b; and hence h(c(x(@))) — h* = 
h}(c(x(@)) — e*) V i € 9. An inner product with u}* shows that x(@) is feasible in 
(14.2.9) and hence s € @*. O 


Corollary 


If in addition * contains n + 1 elements and of <= o* then G* (and hence ¢* 
by lemma 14.2.6) is empty. 
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Proof 


* a . 
A; — p must now have n elements so as above lls ll= 1 is contradicted. O 


Conditions (14.2.25) are realistic in cases (I) and (II) (see (14.1.3)) and in case 
(IIT) when c* # 0. However in cases (IV) and (V) involving |l « |l,, the set * con- 
tains 24 elements where q is the number of zero components of c*, and so indepen- 
dence of the vectors (14.2.25) is unlikely. In this case however the following device 
can be used. In the neighbourhood of a solution x”, the function 


$(x) = f(x) + Ile(x) Il, (14.2.26) 
can equivalently be written as 
b(x)=f(x)+ LZ dje(x)+ LD | e;(x)| (14.2.27) 
i¢Z iEZ 


where Z = {i: c; = O}, and using (14.1.15). Essentially the smooth part of (14.2.26) 
is extended to include terms )j-c;(x), i ¢ Z, and the ll e(x) ll, part is reduced corre- 
spondingly. Then lemma 14.3.1 can be invoked to show that independence of the 
vectors a;, i © Z, yields Y* = G*. A slightly more general result is given by Fletcher 
and Watson (1980). A similar result holds good when IIc” Il, is used in (14.2.26). 

A final observation about second order conditions is contained in the following. 


Lemma 14.2.8 


For given x* let p* satisfy (14.1.8) and * = Hn*. Then the second order conditions 
of theorem 9.3.2 for problem (14.1.6) are equivalent to those in theorem 14.2.3. 


Proof 
In both cases the first order conditions hold. Let s? = (s', s,41). The second order 
conditions of theorem 9.3.2 for (14.1.6) involve the set 

(ates One 6244 S58 tA hy) ol © 

tres (g eA hy) eww ay 7). (14.2.28) 

Let s € (14.2.28). An inner product with u; implies s+, = 0. For any A€ 0h* let 
X= Hyp; then an inner product with ; gives s'(g* + A*A) <0 and hence s/Ils lp € G* 
in (14.2.20). Now lets EG", let p€ Wf * and define s,4, =s!(g* + A*h, ). Then 
as in lemma 14.2.7, 

es (ePAaley, Roe, 

Snt1 > s!(g* + A*h;), Leo it 
Since .W* D oF it follows that § € (14.2.28). Thus apart from normalization the 
vectors s in (14.2.28) and in G* are equivalent. The second order conditions that 
s™W*s > 0 for all s in either (14.2.28) or G* are also equivalent so the lemma is 
proved. O 
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14.3 Exact Penalty Functions 


One of the most important applications of NDO is in the area of nonlinear pro- 
gramming through the use of an exact penalty function and is described in this 
section. To simplify the presentation, two basic types of nonlinear programming 
problem are considered, the equality constraint problem 


minimize f(x) 
x 


(14.3.1) 
subject to e(x) =0 
for which the corresponding exact penalty function problem is 
minimize $(x) 4 vf(x) + lle(x) Il, (14.3.2) 
x 
and the inequality constraint problem 
minimize f(x) 
(14.3.3) 
subject to e(x) <0 
for which the corresponding exact penalty function problem is 
minimize $(x) 4 vf(x) + lle(x)* Il (14.3.4) 
x 


where c; denotes max (c;, 0). There is no difficulty in generalizing to the mixed 
problem (7.1.1) and this can be done within the framework of (14.3.3) by replacing 
c; = 0 by c; <0 and —c; <0. Note that in (14.3.3) the inequality is in the reverse 
direction from that considered in Chapter 9. This is equivalent to a change of sign 
in the constraint residuals e(x) which must be kept in mind. These penalty func- 
tions are exact in the sense of Section 12.5, so that (14.3.1) and (14.3.2) (or 
(14.3.3) and (14.3.4)) are equivalent in that for sufficiently small v a local solution 
of (14.3.1) is a local minimizer of (14.3.2), and vice versa. This result is made pre- 
cise in theorems 14.3.1 and 14.3.2 below. Practical considerations in choosing v are 
discussed later in the section. The conditions under which equivalence holds are 
usually satisfied in practice but there are examples when the problems are not 
equivalent. Examples where x” solves (14.3.3) and not (14.3.4) are given at the 
end of theorem 14.3.1. However these examples are either pathological or limiting 
cases and need not greatly concern the user. The alternative possibility is that a 
local minimizer x* of (14.3.2) may not be feasible in (14.3.1), even though the 
latter may have a solution. This is the same situation as that illustrated in Figure 
6.2.1 and described in Section 6.2 (Volume 1): it is an inevitable consequence of 
the use of any penalty approach as a means of inducing global convergence. To 
circumvent the difficulty is a global minimization problem and hence generally 
impracticable. Corresponding advantages are that best solutions can be determined 
when no feasible point exists in (14.3.1) and that the difficulty of finding an initial 
feasible point is avoided. In practice the most likely unfavourable situation which 
arises when applying an exact penalty function is that a sequence {x} is calcu- 
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lated such that o*) ~> —oo, which is an indication that the calculation should be 
repeated with a smaller value of v. It may be necessary to pre-scale the c;(x) to be 
of comparable magnitude so that the use of a single penalty parameter v is reason- 
able (see also (14.3.17)). As an illustration the non-linear programming problem: 
min —x, — x subject tox? + x3 <1 with solution x} = x3 = A* = 1/s/2 is con- 
sidered. The corresponding exact penalty function is 


(x) = v(—x — x2) + max (x? + x3 — 1, 0) (14.3.5) 


for which the threshold value of v in theorem 14.3.1 is v< 1/A* =4/2. The con- 
tours of ¢(x) for y= 1 are shown in Figure 14.3.1 and it is clear that x* minimizes 
¢({x). The non-differentiable nature of (x) on the unit circle (broken line) can also 
be seen. 

The main attraction in using (14.3.2) or (14.3.4) is that it holds out the possi- 
bility of avoiding the sequential nature of the penalty functions in Sections 12.1 
and 12.2 so that only a single unconstrained minimization calculation is required. 
Unfortunately (even if |! ll, is used) (14.3.2) and (14.3.4) are not smooth (C') 
functions so that the many effective techniques for smooth minimization described 
in Volume 1 cannot adequately be used. The study of algorithms for non-smooth 
problems is a relatively recent development and is described in some detail in 
Section 14.4 and some of the algorithms can be used here. Another approach (Han, 
1977; Coleman and Conn, 1980; Mayne, 1980) is to use an algorithm for nonlinear 
programming as a means of generating a direction of search, and to use the exact 
penalty function as the criterion function to be minimized (approximately) in the 
line search. This approach often works well in practice but unfortunately can fail 
(see Fletcher, 1980b) and I think it is important to take into account the non- 
smooth nature of the penalty function when choosing the direction in which to 
search. The discussion of algorithms in Section 14.4 leads to a globally convergent 
algorithm for composite NDO problems (Fletcher, 1980a) which works well when 





Figure 14.3.1 Contours for the exact 
penalty function (14.3.5) 


192 


applied to minimize exact penalty functions (Fletcher, 1980b), and it is this type of 
algorithm which I currently favour. Use of the Il ll, is the most convenient for 
reasons given later in this section. An example of the algorithm applied to minimize 
(14.3.5) is shown in Section 14.4. The general algorithm is described in more detail 
in Section 14.5; it usually converges at a second order rate and is closely related to 
the SOLVER method (Section 12.3) when applied to exact penalty functions. 

In some cases however a second order rate of convergence is not always obtained 
(the Maratos effect — see Section 14.4) and the best way of dealing with this is 
currently the subject of research. 

The theory for exact penalty functions can be set out in a very concise and 
general way. The definition in (14.3.2) allows the use of any norm and in (14.3.4) 
any monotonic norm (|x| <|y| => Ilx |< lly ll). This latter condition ensures that 
Ilc* ll is a convex function (see Question 14.2) and includes all L, and scaled L, 
norms for 1 < p < ». Thus the results of this section are not just restricted to 
polyhedral norms. It is convenient to introduce the concept of a dual norm 

lullp 4 max uly. (14.3.6) 
Ivii<1 
The dual of Il+ ll, is I+ ll.., and vice versa, and the Il lly is self-dual. Expressions for 
the subdifferential of the functions llc ll and lle* ll are given by 


Dic l=(121, c= Nol, Maly —al) (14.3.7) 
and 
alle N= (he Anca oul A= Obl el. (14.3.8) 


The proof of these expressions is sketched out in some detail in Questions 14.4 and 
14.5. Expressions (14.1.12) to (14.1.15) are special cases of (14.3.7) and (14.3.8). 
The main results giving the equivalence between local solutions of (14.3.1) and 
(14.3.2) (or between (14.3.3) and (14.3.4)) can now be given. In fact the latter 
result only is given; the former case is similar (but easier). A point to bear in mind 
is that in these theorems A” refers to a multiplier vector for the constrained prob- 
lem (14.3.3) and v4" is the equivalent multiplier vector for the exact penalty func- 
tion problem (14.3.4). 


Theorem 14.3.1 


If v<.1/llA* lp and c** = 0 then the second order sufficient conditions at x* for 
problems (14.3.3) and (14.3.4) are equivalent. Therefore if they hold, the fact that 
x™ solves (14.3.3) implies that x* solves (14.3.4), and vice versa. 


Proof 


Second order conditions are given for problem (14.3.3) in theorem 9.3.2 and for 
problem (14.3.4) in theorem 14.2.3. The first order requirements are that 


gt +A*2*=0, 1°50, 2*7Tc*= Ilc** Il = 0 (with a suitable sign change), and by 
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vg* + A*(va*) =0, va* €an* respectively, which are clearly equivalent from 
(14.3.8) if v <1/ll4* Ilp. Next the sets G* defined in (9.3.11) (with Ils l= 1) and 
Uae 2.20) are shown to be Sreyalcet For convenience refer to these as Gg and 
Gi4 respectively. Let s € G74. From (14.2.20) and the above it follows that 


sA*(A—vN <0 Vo LEalc** I. (14.3.9) 


Let ./* denote active constraints at x* in (14.3.3). For i€ .* and small € it 
follows using Ilva" Ilp <1 that X= vA* + ee; EA Ilc** ll either if \*¥ = 0 and e >0 
or if Ne > Oand tie > 0. Hence from (14.3.9), 


Sues if- A; =O; 
<0 if 7 = 0, 


Thus s € Go. Conversely let s € Go. It follows from (14.3.10) that sta} = 0 which 
also implies that s'g* = 0 from the first order conditions. Let A€ dn*; then 

a%c* = 0 and A> 0 imply that if c <0 then A; = 0 so constraints i ¢.* can 

be ignored. Otherwise s'a;'A; <0 follows from (14.3.10) and hence 

max AGah* s!(g* 4F A*)) = 0. Thus s€G*, and the equivalence of Go and Gjy is 
shown. Finally conditions (9.3.15) and (14.2.24) are clearly equivalent so the 
equivalence of the second order conditions is demonstrated. O 


iE f™. (14.3.10) 


Similar theorems relating the solutions of (14.3.3) and (14.3.4) are given by 
Charalambous (1979) and Han and Mangasarian (1979). 

The requirement that second order conditions hold and that p< 1/l|4* Ilp in 
theorem 14.3.1 cannot easily be relaxed, as the following simple examples show. 
In all of these, x* = 0 solves (14.3.3) but does not solve (14.3.4) using Il - ll, , for 
the reasons given. For the problem: minx subject tox* <0, x* is not a KT point. 
For the problem: min x? subject tox*® > 0, x* is a KT point and A* = 0 but the 
curvature condition (9.3.14) is not strict. For the problem: min x — 4x“ subject to 
O0<x <1, x* is a KT point with \* = 1. Then if y= 1 the condition v < 1/IlA* Ilp 
is not strict for ZL, norm. The last example illustrates another fact. The proof of 
theorem 14.3.1 shows that Gg C Gj4 without requiring that vy <1/|lA* Ilp. (In the 
last example G5 is empty whilst Gq is the direction s = —1.) Thus if x” satisfies 
second order sufficient conditions for (14.3.4) it follows that both y< 1/llA* Ilp 
and x™ satisfies second order sufficient conditions for (14.3.3). Finally if 
v>1/\lA* llp then a result going the other way can be proved. 


Theorem 14.3.2 


If first order conditions g. + A*4* =O hold for (14.3.3) and ifv>1/\ 1* lly then 
x” is not a local minimizer of (14.3.4). 


Proof 


Since lva* Ip > 1, the vector vA* is not ind Iic** | and so the vector 
ve" + A*(va*) = 0 is not in 06”. Hence by (14.2.6) x” is not a local minimizer of 


(x). O 
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Another important result concerns the regularity assumption in (14.2.21). 
If x*, A” satisfy first order conditions for either (14.3.1) or (14.3.3) (including 
feasibility) then under mild assumptions (14.2.21) holds, even though the norm 
may not be polyhedral. Again the result is proved only in the more difficult case. 


Lemma 14.3.1 


If x*,.* is a KT point for (14.3.3), if the vectors a; , 1€.L™, are linearly indepen- 
dent, and if |v lp <1, then Y* =G*. 


Proof 


Let s€G"*. Then as in the proof of theorem 14.3.1, (14.3.10) holds. By the 
independence assumption there exists an arc x(@) with x(0) = x* and x(0) =s, for 
which 


c;(x(@)) = 0 iP Ap > OF 
c;(x(0)) <0 iy Gata 


for 6 20 and sufficiently small (see the proof of lemma 9.2.2). It follows that 
llc(x(0))* Il = e(x(@))7A* = 0 so x(@) is feasible in (14.2.19) and hencesEY*. O 


ie 


In practice there are considerable advantages in using the L; norm to define 
(14.3.1) or (14.3.3). This choice has been researched widely, for example by 
Pietrzykowski (1969), Conn (1973), Han (1977), Coleman and Conn (1980), 
Fletcher (1980b), and Mayne (1980) in nonlinear programming applications and 
by Barrodale (1970) in L, data fitting problems. The general nonlinear program- 
ming problem (7.1.1) can be written 


minimize f(x) 
x 
subject to c;(x) = 0, i€E (14.3.11) 
c;(x) <0, tet 


with the sign convention of this chapter. The corresponding L; exact penalty 











c 
= 
pzftct 
ip 
Penalty term for a ¢<O constraint O< X:-aF Vde<1, ¢f is minimized NM =-d6 de > 4, ¢ can be 
me i 
when c*=O reduced by increasing c when c*=O 


Figure 14.3.2 Optimality conditions for an L, exact penalty function 
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function is 


(x) = vf(x) + 2 | e;(x)| + Z max (c;(x), 0). (14.3.12) 
1SE tet 


From theorem 14.2.1 and a combination of (14.1.14) and (14.1.15) it follows that 
first order necessary conditions for (14.3.12) are that there exists a Lagrange 
multiplier vector A* such that 


ve + A*2r* = 0 
IAF I<1 
A; = signc;, ifc; #0 hice 
0<)7 <1 
ak ife; >0 ied. 
0) ifce; <0 


(14.3.13) 


The differences between this system and the first order conditions for (14.3.11) 
(see (9.1.16)) are not great. Infeasible points in (14.3.11) are permitted in (14.3.13) 
in which case the corresponding Lagrange multipliers must take the value 

dh} = sign c;. If x* is feasible in (14.3.11) however, this difference does not arise. 
The parameter v occurs, which is equivalent to saying that A* is the multiplier of 
the scaled problem in which vf(x) replaces f(x) in (14.3.11). The only other 
significant difference between (9.1.16) and (14.3.13) is the requirement that 

|A;| <1 in the latter. A simple interpretation of this is the following. Define 

Z* = {i: cj = 0} and perturb the constraints i € Z* as in (9.1.9). Now x” also solves 


minimize vf(x)+ LD rAjej(x) 
x j¢Z* 


(14.3.14) 
subject to c;(x) = 0, iEZ* 
so using (9.1.10) it follows that 
d(pf+ LY rje;) 
ee =-7, iez* (14.3.15) 





de; 


(allowing for the sign change). If in addition i € /, then for increasing €; the deriva- 
tive of the remaining terms which make up (x) is just unity. Thus 


dg" 
de; 


The necessity for x* to be minimal thus ensures that Aj <1. A similar development 
gives |; |< 1 fori € Z*2 E. Thus for example if (14.3.13) holds except that 

dj > 1 for some i € Z* I, then there exists a perturbation with e; > 0, e; = 0, 

j #i, such that the rate of decrease —)} in the terms vf + Lyi¢z* NC; is greater than 
the unit rate of increase in the penalty term max (c¢;, 0). The situation is illustrated 
in Figure 14.3.2 for a single constraint and v = 1. Also the requirement 

vy <1/llA* lp in theorem 14.3.1 is seen largely as a need to ensure that the scaled 


==)\ei+ 1, iEZ* OT. (14.3.16) 
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multipliers vA* satisfy the conditions | vA; | < 1 (since Il+ Ilp = II- ll. in this case). 
A further feature of the multipliers in (14.3.13) is that they can be interpreted as 
the optimum variables in a dual problem in the case that f(x) is convex. This result 
is set out in Question 14.14. 

An important advantage of using the Z, norm is that the discontinuities in 
derivative occur along surfaces c;(x) = 0 and the NDO problem can therefore be 
considered as the equivalent smooth problem (14.3.14) which has a very simple 
structure. Also the subproblem (14.5.2) which is solved in algorithm (14.5.6) is 
closely related to the regular QP problem (12.3.13) which arises in the SOLVER 
method. It is readily solved without adding extra variables as described in (10.3.5) 
and (10.3.6). 

Finally some remarks about the practical choice of the parameter py are made. 
The smaller v is, the more fis damped out relative to llc ll, and the less accurately 
is the solution located in the tangent plane of the zeroed constraints (cf. Figure 
12.1.2, 0 = 100). Also when following a ‘curved valley’ along the discontinuity 
c;(x) = 0 by a sequence of line searches, then as v decreases the length of the 
correction which can be made decreases and hence the total number of line searches 
increases. Thus it is advantageous to keep v reasonably large, yet less than the 
threshold value 1/|| A* ll... Too small a value of p is indicated by the multipliers 
X* in (14.3.14) being uniformly small. Then it can be advantageous to increase v 
(by a power of 10 say) so that the active \j are approximately in the range 
(0.1, 1). On the other hand, a sequence o*) + co may occur which can require 
a smaller value of v or a different starting point to be chosen. If the active \; differ 
widely in magnitude this can be an indication that the constraints should be 
rescaled. Some algorithms allow this to be done automatically, in which case it can 
be better (for example Powell, 1978a) to rewrite (14.3.12) as 


(x)= f(x) + Z wile;(x)| + 2 uw; max (c;(x), 0) (14.3.17) 
iGE rtf 


so that the weighting parameter multiplies the constraint functions and their 
relative weight can be adjusted. 


14.4 Algorithms 


This section reviews progress in the development of algorithms for many different 
classes of NDO problem. It happens that similar ideas have been tried in different 
situations and it is convenient to discuss them in a unified way. At present there is 
a considerable amount of interest in these developments and it is not possible to 
say yet what the best approaches are. This is particularly true when curvature 
estimates are updated by quasi-Newton like schemes. However some important 
common features are emerging which this review tries to bring out. 

Many methods are line search methods in which on each iteration a direction of 
search s‘*) is determined from some model situation, and x(**!) = x(*) 4. q(k)g(k) 
is obtained by choosing a*) to minimize approximately the objective function 
(x) + as(*)) along the line (see Section 2.3, Volume 1). A typical line search 
algorithm (various have been suggested) operates under similar principles to those 


[oF 


described in Section 2.6 and uses a combination of sectioning and interpolation, 
but there are some new features. One is that since ¢(x) may contain a polyhedral 
component h(c(x)), the interpolating function must also have the same type of 


structure. The simplest possibility is to interpolate a one variable function of the 
form 


(a) = g(a) + A(R(a)) (14.4.1) 


where q(q) is quadratic and the functions 2(a) are linear. For composite NDO 
applications g and & can be estimated from information about f and c, and a) is 
then determined by minimizing (14.4.1). For basic NDO, only the values of ¢ and 
d@/da are known at any point so it must be assumed that (a) has a more simple 
structure, for example the max of just two linear functions &;(a). Many other 
possibilities exist. Another new feature concerns what acceptability tests to use, 
analogous to (2.4.2), (2.4.3), (2.4.7), and (2.4.8) for smooth functions. If the line 
search minimum is non-smooth (see Figure 14.4.1) then it is not appropriate to 
try to make the directional derivative small, as (2.4.8) does, since such a point may 
not exist. Many line searches use a combination of (2.4.2) and (2.4.7). For this 
choice the range of acceptable a-values is the interval [a, b] in Figure 14.4.1, and 
this should be compared with Figure 2.4.1. It can be seen that this line search has 
the effect that the acceptable point always overshoots the minimizing value of a, 
and in fact this may occur by a substantial amount, so as to considerably slow down 
the rate of convergence of the algorithm in which it is embedded. I have found it 
valuable to use a different test (Fletcher, 1980b) that the line search is terminated 
when the predicted reduction based on (14.4.1) is sufficiently small, and in par- 
ticular when the predicted reduction on any subinterval is no greater than 0.1 times 
the total reduction in ¢(x) so far achieved in the search. This condition has been 
found to work well in ensuring that a‘*) is close to the minimizing value of a. Also 
when using a Newton-like method and when x *) is close to the solution, then the 
unit step can be accepted without further searching. 





O a b 
Figure 14.4.1 Line search for non-differentiable 
functions 
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It is now possible to examine algorithms for NDO problems in many variables. 
In basic NDO only a limited amount of information is-available at any given point 
x, namely $(x) and one element g € 0(x) (usually Vo(x) since $(x) is almost 
everywhere differentiable). This makes the basic NDO problem more difficult than 
composite NDO in which values of f and ¢ (and their derivatives) are given in 
(14.1.1). The simplest method for basic NDO is an analogue of the steepest descent 
method in which s‘*) = —g) is used as a search direction; this is known as 
subgradient optimization. Because of its simplicity the method has received much 
attention both in theory and in practice (see the references given in Lemarechal 
and Mifflin (1978)), but it is at best linearly convergent, as illustrated by the results 
for smooth functions (see Section 2.3). In fact the situation is worse in the non- 
smooth case because the convergence result (theorem 2.4.1) no longer holds. 
In fact examples of non-convergence are easily constructed. Assume that the line 
search is exact and that the subgradient obtained at x**1) corresponds to the 
piece which is active fora> a‘*) Then the example 
o(x)= max ¢;(x) 
jats2s3 

CX) =r 3 (14.4.2) 

C2(x) =x} +x} + 4x2 

C3 (xX) Set Xo 


due to Demyanov and Malozemov (1971) illustrates convergence from the initial 
point x“) (see Figure 14.4.2) to the non-optimal point x“ = 0. The solution is at 
x* = (0, —3)7. Atx®), x@), __. the subgradient corresponding to c, is used and at 
x2) x(4) | the subgradient corresponding to c3. It is easily seen from Figure 
14.4.2 that the sequence {x‘*)} oscillates between the two curved surfaces of non- 
differentiability and converges to x~ which is not optimal. In fact it is not necess- 
ary for the surfaces of non-differentiability to be curved, and a similar polyhedral 
(piecewise linear) example can easily be constructed for which the algorithm also 
fails. 

It can be argued that a closer analogue to the steepest descent method at a point 
of non-differentiability is to search along the steepest descent direction s‘*) = —g(*) 
where g) minimizes llg ll, for g € 06“*), This interpretation is justified for convex 
functions in (14.2.12) and a similar result holds when the composite function $(x) 
is in use. Let 06) be defined by the convex hull of its extreme points, a6 = 
conv gi"), say. (For composite functions gi*) = glk) + AMh,;, ic Als) see 
(14.2.16).) Then @“*) is defined by solving the problem 

minimize gig 
8 
subject to g= Suet”, (14.4.3) 
U 


Pid) Zod, nO 
U 


>: 


and s“*) = —g(*) This problem is similar to a least distance QP problem and is 
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Figure 14.4.2 False convergence of steepest-descent-like methods for 
problem (14.4.2) 


readily solved by the methods of Section 10.5. The resulting method terminates 
finitely when ¢(x) is polyhedral (piecewise linear) and exact line searches are used. 
This latter condition ensures that the x“) are on the surfaces of non-differen- 
tiability for which 06) has more than one element. The algorithm is idealized 
however in that exact line searches are not generally possible in practice. Also the 
whole subdifferential ag) is not usually available, and even if it were the non- 
exact line search would cause ag”) = Vol*) to hold and the method would revert 
to subgradient optimization. 

The spirit of this type of method however is preserved in bundle methods 
(Lemarechal, 1978). A bundle method is a line search method which solves sub- 
problem (14.4.3) to define s(*) except that the vectors gi*) are elements of a 
‘bundle’ B rather than the extreme points of d¢). Initially B is set to 
B= g) € a) and in the simplest form of the algorithm, subgradients gi?) 
g(>)... are added to B on successive iterations. The method continues in this way 
until 0 < B. Then the bundle B is reset, for instance to the current gl*) and the 
iteration is continued. With careful manipulation of B (see for example the 
conjugate subgradient method of Wolfe (1975)) a convergence result for the algor- 
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ithm can be proved and a suitable termination test obtained. However Wolfe’s 
method does not terminate finitely at the solution for polyhedral ¢(x). An example 
(due to Powell) has the pentagonal contours illustrated in Figure 14.4.3. At x ) 

B is g,, and at x?) B is {g,, 83}. At x3) B is {g1, $3, 84} and 0 €B so the process 
is restarted with B = g3 (or B = gq with the same conclusions). The whole sequence 
is repeated and the points x*) cycle round the pentagon, approaching the centre 
but without ever terminating there. 

This non-termination can be corrected by including more elements in B when 
restarting (for example both g3 and gq) or alternatively deleting old elements like 
g,. This can be regarded as providing a better estimate of ag’*) by B. However it is 
by no means certain that this would give a better method: in fact the idealized 
steepest descent method itself can fail to converge. Problem (14.4.2) again illus- 
trates this; this time the search directions are tangential to the surfaces of non- 
differentiability and oscillatory convergence to x” occurs. This example illustrates 
the need to have information about the subderivative at 06~ (which is lacking in 
this example) to avoid the false convergence at x . Generalizations of bundle 
methods have therefore been suggested which attempt to enlarge ag) so as to 
contain subgradient information at neighbouring points. One possibility is to use 
the e-subdifferential 


0.$(x) ={g: o(x+8)>4(x)+g'6—e€ V5} 


which contains 0¢(x). Bundle methods in which B approximates this set are dis- 
cussed by Lemarechal (1978) and are currently being researched. 

A quite different type of method for basic NDO is to use the information 
xo). o*). gi) obtained at any point to define the linear function 


0 + (x — xO) Tg(*) (14.4.4) 











Figure 14.4.3 Non-termination of a bundle method 


201 


and to use these linearizations in an attempt to model $(x). If $(x) is convex then 
this function is a supporting hyperplane and this fact is exploited in the cutting 
plane method. The linearizations are used to define a model polyhedral convex 
function and the minimizer of this function determines the next iterate. Specifically 
the linear program 


minimize v (14.4.5) 
X,U 


subject to v > 6 + (x — xM)Tg®_ bee Ak 


is solved to determine x‘**1). Then the linearization determined by x(**!), 
o** 1) and g(**") is added to the set and the process is repeated. On the early 
iterations a step restriction llx — x) ll, <h is required to ensure that (14.4.5) is 
not unbounded. A line search can also be added to the method. A similar method 
can be used on non-convex problems if old linearizations are deleted in a systematic 
way. 

More sophisticated algorithms for basic NDO are discussed at the end of this 
section and attention is now turned to algorithms for minimizing the composite 
function (14.1.1) (which includes max functions, etc., when h(c) is the polyhedral 
function (14.1.2) — see the special cases in (14.1.3)). Linear approximations have 
also been used widely and the simplest method is to replace c(x) by the first order 
Taylor series approximation 


o(x) +8) ~ 2()(8) = c(*) + Al'5 (14.4.6) 
and f(x) (if present) by 
fx +6) xf + "6, (14.4.7) 


and to substitute these approximations into (14.1.6) which is an equivalent form of 
the original problem. The linear program 
minimize v 
6, U 
T 
subject to v — (g) + Ah )T5 Sf +c hy +b, Vi (14.4.8) 


is solved to determine 8“) and hence x‘**!) = x) + 8). The only difference 
between this and (14.4.5) is that here there is sufficient information available at 
x) to determine linear approximations to all the pieces, whereas in basic NDO this 
information has to be accumulated on a sequence of iterations. When applied to 
min llc(x) ll. or min lle(x) Il, the iteration based on (14.4.8) can be considered as a 
Gauss—Newton method by analogy with the same method for solving min ll e(x) 13 
(see Section 6.1, Volume 1) which is based on the same linear approximations. An 
early study of this type of method is by Osborne and Watson (1969) and a more 
elaborate recent method is given by Charalambous and Conn (1978). As with 
Gauss—Newton methods, convergence is not guaranteed but this can likewise be 
rectified by going to a restricted step type of method. In this case Il x — eed fe 
h\*) is a suitable choice since it preserves a linear programming subproblem. This 
approach has been investigated by Madsen (1975). 
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Linear approximation methods are most successful when the linearizations of 
the active pieces at x* fully determine x* (that is to say when the first order 
conditions are sufficient, which occurs when the set G* in (14.2.20) corresponding 
to directions of zero slope is empty and curvature effects are negligible). A necess- 
ary condition for this is that there are n + 1 (or more) active pieces at x”. In these 
circumstances the rate of convergence of the method based on (14.4.8) is second 
order (to prove this, show that the method is ultimately equivalent to the Newton— 
Raphson method for solving the active constraints in (14.4.8)). This situation is 
most likely to occur in problems such as over-determined L,; or L.. data fitting, 
where these methods can be very successful. 

In general however it is not possible to exclude curvature effects, and when 
these are important then methods based only on linear information converge slowly 
and become unreliable. However the problem min ¢(x) is equivalent to the non- 
linear program (14.1.6), and it is known that the SOLVER method of Section 12.3 
has a second order rate of convergence. Following (12.3.13), and given an approxi- 
mation wp“) to the optimum multipliers p* in (14.1.8), the QP problem 


minimize v + 487 (Du(V2 4 se V2(cTh;))6 
6,v i Z (14.4.9) 
subject to v — (g) +AMh 75> fH +c ht+b Vi 


is set up. Using Dus) = 1, Nek Hp) | and writing the matrix 
V2 6) + oY V2.8) =V2 L(x, 40) as WH), (14.4.9) can be written as 
minimize v + 467 Ww) 
6,v - (14.4.10) 
subject to v — (gp) + AMh TES fH 4°) he tb Vi. 


This is similar to (14.4.8) but contains an extra quadratic term in the objective 
function which accounts for curvature in the functions fand c. Thus the analogue 
of the SOLVER method is to solve (14.4.10) to determine a correction 5*) and 
then set x¥*1) = x + §(). In addition the multipliers of the constraints in 
(14.4.10) become the next approximation p{**!) to be used in constructing 
w(**1)_ This method is mentioned briefly by Pshenichnyi (1978) in the context 
of minimizing max functions. 

An important observation is that the above algorithm is closely related to one 
in which 8‘? is chosen to solve 


minimize (6) 4 g(8) + nh (§)) (14.4.11) 
8 


where 
Af 
g™)(5) 387 WS + Bh) 6 + fH) (14.4.12) 


and where &“*)(§) is defined as in (14.4.6). Then x**!) = x(*) + §) and a(F*1) jg 
chosen as the multipliers which exist at the solution to (14.4.11). This method is 
referred to here as the QL method since it makes both quadratic and linear approxi- 
mations. It is of wider application than (14.4.10) since the latter requires that h(c) 
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is polyhedral. However when this occurs then the QL method and that based on 
(14.4.10) are equivalent. This is readily shown by setting w =u +48'W)S so that 
(14.4.10) becomes 
minimize w 
6,w 
subject to w> q)(8) + hie (8) +b, Vi 


which is equivalent to (14.4.11) (Fletcher, 1980a). The QL method is seen to have 
a nice interpretation. The function y*)(8) defined in (14.4.11) approximates the 
composite function ¢(x) (= 6(x“*) + 8)) local to x“), The approximation is con- 
structed by replacing c(x) in (14.1.1) by the linear approximation 9°) (8) and f(x) 
by the quadratic approximation q‘)(8). This contains the matrix TMH y2e%) 
which accounts for curvature in the functions c;(x). This is exactly the way in 
which the SOLVER method itself is obtained from a nonlinear programming 
problem (Section 12.3). 

An illustration of the QL method applied to solve problem (14.3.5) with v= 1 is 
given. Initial approximations x“!) = (2, 0)! and A“) = 1 are taken and the progress 
of the iterations is given in Table 14.4.1. It can be seen that the rate of convergence 
is second order with |16**!) ||,/11 8 {3 ~ 0.5. The contours of the approximating 
function W) for k= 1 and k = 2 are illustrated in Figure 14.4.4. Each function has 
quadratic pieces with a discontinuous derivative on a linear surface (partly dotted) 
which is the linearization (through (14.4.6)) of the unit circle on which the dis- 
continuity in Figure 14.3.1 occurs. 

It is possible under mild conditions to prove that the QL method converges 
locally at a second order rate in a variety of circumstances. The most straight- 
forward case uses the equivalence between (14.4.10) and (14.4.11). It is assumed 
that second order sufficient conditions hold at x*, uw” for problem (14.1.6) 

(or equivalently at x*, A* for problem (14.1.1) — see lemma 14.2.8), that 

A* = Wand that the independence assumption (14.2.25) holds. Using the 
result of Question 14.11, these conditions imply the local convergence at a second 
order rate of the SOLVER method applied to (14.1.6), that is method (14.4.10), 
by virtue of theorem 12.3.1 and Question 12.18. Thus the QL method also con- 
verges locally at a second order rate. The second order sufficient conditions are 
almost equivalent to the necessary conditions (see lemma 14.2.7) so this assump- 
tion is likely to be valid in practice. The independence assumption is likely to hold 


Table 14.4.1 Application of the QL method to problem (14.3.5) 








x(k) 2 (25 0.804310 0.712691 0.707133 0.707107 
0 0.5 0.801724 0.712981 0.707127 0.707107 
A) 1 0.625 0.622844 0.692599 0.706968 0.707107 
o*) 1 = (03375 —1.316358 —1.409402 —1.414194 —1.414214 
8) | —0.75 -—0.445690 —0.091619 —0.005558 -—0.000026 249-10 
0.5 0.301724 —0.088744 —0.005854 —0.000020 —9j9-10 
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Figure 14.4.4 Contours of the approximating function in the QL method 


when there are at most n + 1 elements in.w, that is in case (I) and (II) situations 
in (14.1.3) and in case (IIT) when c* + 0. However case (IV) and (V) situations are 
not included — see the comments after lemma 14.2.7. 

Another situation which can be handled is when h(c) is lle Il (or Ilc* Il) and 
c* = 0 (orc** = 0). The norm need not be polyhedral; this is essentially the situ- 
ation of Section 14.3 that an exact penalty function is being used to solve the non- 
linear programming problem (14.3.1) (or (14.3.3)). Let »/* refer here to the active 
constraints in (14.3.1) or (14.3.3). It is assumed that vA* € interior(dh*) (that is 
lv lp <1 and Aj > 0, 1€.9/*, when h(c) = Ilc* Il) and that the vectors aj, 
i€.%™* are linearly independent, which are mild assumptions. The argument is 
given in the more difficult case that h(c) = llc" ll. It follows from theorem 14.3.1 
that second order conditions for (14.3.3) and (14.3.4) are equivalent and these are 
assumed to hold. Let A* have columns a;, i€ .*, and let Z* be any basis for the 
null column space of A” as in Section 10.1. Then the second order conditions for 
(14.3.3) are equivalent to the condition that Z*™W*Z” is positive definite 
(Question 11.1). Define Z* by (10.1.22) (direct elimination) assuming without 
loss of generality that Aj is non-singular. Then Z is a continuous function of A in 
a neighbourhood of A*. Thus for x (Kk), i) ina neighbourhood of x*, 4*, 
Z(*)' WC)Z(*) is positive definite (continuity of eigenvalues) and hence as in 
Questions 11.1 and 12.18, second order sufficient conditions hold for the problem 


minimize gq (8) 
ft) 
subject to (8) <0. 


This is the subproblem which is solved when the SOLVER method is applied to 
(14.3.3). It follows by theorem 14.3.1 that this subproblem is equivalent to 
(14.4.11) which is the QL method applied to (14.3.4). Hence the QL method is 
equivalent to SOLVER in this case also. Under the same assumptions the SOLVER 
method converges locally at a second order rate so the same is true of the QL 
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method. Also the assumptions imply that lemma 14.3.1 holds so that the sufficient 
conditions are almost necessary and hence likely to be achieved in practice. 

This result can also be used to tackle the more difficult case (IV) and (V) situ- 
ations in (14.1.3) involving ll - ll, when c** #0 or c* £0. Then using a splitting 
like (14.2.27) the problem is reduced to that in the previous paragraph, and the 
same conclusions can be deduced under the mild assumption that the vectors 
a; , i €Z, are linearly independent. Thus in most practical cases the QL method 
converges locally at a second order rate. A final remark about rates of convergence 
is that when h(c) = IIc ll is a smooth norm (for example |l- ll.) and c* #0 then 
Y* #G* and the above development is not appropriate. In this case the most 
appropriate second order sufficient conditions are those for a smooth problem that 
Vo* = Oand V7" is positive definite. However the QL method does not reduce to 
Newton’s method so it is an open question as to whether a second order rate of 
convergence can be deduced. 

These results on second order convergence make it clear that the QL method is 
attractive as a starting point for developing a general purpose algorithm for compo- 
site NDO problems. However the basic method itself is not robust and can fail to 
converge (when m = 0 it is just the basic Newton’s method — see Section 3.1, 
Volume 1). An obvious precaution to prevent divergence is to require that the 
sequence {o*)} is non-increasing. One possibility is to use the correction from 
solving (14.4.11) as a direction of search s“*) along which to approximately mini- 
mize the objective function ¢(x*) + as*)). Both the basic method and the line 
search modification can fail in another way however, in that remote from the 
solution the function *)(§) may not have a minimizer. This is analogous to the 
case that G“) is indefinite in smooth unconstrained minimization. An effective 
modification in the latter case is the restricted step or trust region approach 
(Chapter 5, Volume 1) and it is fruitful to consider modifying the QL method in 
this way. Since (14.4.11) contains no side conditions the incorporation of a step 
restriction causes no difficulty. This is illustrated in Figure 14.4.4 (k = 1) in which 
the dotted circle with centre x) is a possible trust region. Clearly there is no 
difficulty in minimizing y“)(8) within this region: the solution is the point on the 
periphery of the circle. Contrast this situation with Section 12.3 where a trust 
region cannot be used in conjunction with the SOLVER method because there may 
not exist feasible points in the resulting problem. Fletcher (1980a) describes a 
model restricted step method for solving composite NDO problems which is 
globally convergent without the need to assume that vectors a; are independent or 
multipliers 2) are bounded. This algorithm is described in more detail in Section 
14.5 where the global convergence property is justified. Good practical experience 
with the method in L, exact penalty function applications is reported by Fletcher 
(1980b), including the solution of test problems on which other methods fail. 

There is one adverse feature of the QL method with line search or trust region 
modifications, and also of other second order algorithms for NDO, which differs 
from experience with similar algorithms for smooth unconstrained optimization. 

In the latter case when x) is close to x* then the unit step of the basic method 
reduces the objective function so that the line search or trust region modification 
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is not brought into play and the second order rate of the basic method is observed 
(for example theorem 5.1.2). The same situation does not hold in NDO as Maratos 
(1978) observes (see also Mayne, 1980, Chamberlain et al., 1980, and Question 
14.13). Thus in some well-behaved NDO problems in which second order effects 
are significant at the solution, x) can be arbitrarily close to x” and the unit step 
of the basic algorithm can fail to reduce ¢(x). The Maratos effect, as it might be 
called, thus causes the unit step to be rejected and may obviate a second order rate 
of convergence result. The effect is most likely to occur when the jump discontinu- 
ity in derivative is large. In my practical experience with L, exact penalty functions 
however the effect has not been noticeable: this may be due to keeping the weight- 
ing parameter p relatively large to reduce ill-conditioning and hence not typical of 
all types of application, Further modifications of second order methods are 
currently being researched to circumvent this difficulty. 
Another disadvantage of the QL method is that it requires second derivative 
matrices V*f and Vc; to be supplied by the user. However there is no difficulty 
in approximating Ww *) for instance by finite differences or a quasi-Newton update, 
so that only first derivative information is required. One possibility is to use the 
modified BFGS formula given by Powell (1978a) to update an approximation 
B) to W) as described earlier in (12.3.18). in view of the superlinear convergence 
result for nonlinear programming applications the same is likely to hold here, 
assuming that ultimately the unit step x(**!) = x*) + 6) is taken on every 
iteration. However the feature of forcing B“) to be positive definite when W* may 
not be, does seem artificial. Alternative approaches which avoid this, and which 
may require much less storage on large problems, keep an approximation BY) to 
the reduced curvature matrix Z? WZ (see (11.1.7) and (10.1.3)) appropriate to 
the problem (14.4.11). Womersley (1981) uses the BFGS formula (unmodified) 
to update B“) and Conn (1979) obtains B™) by a finite difference process. 
Methods which use curvature information appear to be much more efficient in 
practice than those which do not, when second order effects are important. 
Womersley (1981) gives a comparison on a number of test problems. Typically 
for the exact penalty function determined by the TP2 problem (Section 12.3), 
Charalambous and Conn (1978) require 281 gradient evaluations whereas Womersley 
(1981) requires 48. Because of such results there is currently considerable research 
interest in second order methods, as shown by the number of presentations at the 
recent X Symposium on Mathematical Programming (August 1979) in Montreal. 
Techniques based on the QL method are only appropriate for composite NDO 
problems, and it is also important to consider applications to basic NDO in which 
second order information has been used. Womersley (1978) describes a method 
which requires o, Vo, and V7 to be available for any x) P= 1 2y mate ele 
and each such set of information is assumed to arise from an active piece of $(x). 
This information is used to build up a linear approximation for each piece valid at 
x(*) and also to give the matrix W*), This is then used as in (14.4.10). Infor- 
mation about pieces may be rejected if the Lagrange multipliers CBeD) determine 
that the piece is no longer active. Also a Levenberg—Marquardt modification 
(Section 5.2 of Volume 1) is made to W“*) to ensure positive definiteness. A diff 


> 
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culty with the algorithm is that repeated approximations to the same piece tend to 
be collected, which cause degeneracy in the QP solver close to the solution. A modi- 
fication is proposed in which an extra item of information is supplied at each x 
which is an integer abel of the active piece. Such information can be supplied in 
most basic NDO applications. This labelling enables a single most recent approxi- 
mation to each piece to be maintained and circumvents the degeneracy problem. 
Again this approach is considerably more effective than the subgradient or cutting 
plane methods as numerical evidence given by Womersley suggests (~20 iterations 
as against ~300 on a typical problem). A quasi-Newton version of the algorithm is 
currently being researched as described in Womersley (1981). 


14.5 A Globally Convergent Model Algorithm 


The aim of this section is to show that the QL method based on solving the sub- 
problem (14.4.11) can be incorporated readily with the idea of a step restriction 
which is known to give good numerical results in smooth unconstrained opti- 
mization (Chapter 5, Volume 1). Subproblem (14.4.11) contains curvature infor- 
mation and potentially allows convergence at a second order rate, whereas the step 
length restriction is shown to ensure global convergence. The resulting method 
(Fletcher, 1980a) is applicable to the solution of all composite NDO problems, 
including exact penalty functions, best L, and L.. approximation, etc., but 
excluding basic NDO problems. The term ‘model algorithm’ indicates that the 
algorithm is presented in a simple format as a means of making its convergence 
properties clear. It admits of the algorithm being modified to improve its practical 
performance whilst not detracting from these properties. The motivation for using 
a step restriction is that it defines a trust region for the correction 6 by 


lsi<n™ (14.5.1) 


in which the Taylor series approximations (14.4.6) and (14.4.12) are assumed to be 
adequate. The norm in (14.5.1) is arbitrary but either the Il- ll. or the Il- Il, is the 
most likely choice, especially the former since (14.5.2) below can then be solved by 
QP-like methods (see (10.3.5)). On each iteration the subproblem is to minimize 
(14.4.11) subject to this restriction, that is 
minimize W*)(§) 

6 (14.5.2) 

subject to l&Il< a. 


The radius h“)? of the trust region is adjusted adaptively to be as large as possible 
subject to adequate agreement between o(x) + 6) and w)(§) being maintained. 
This can be quantified by defining the actual reduction 


Ag = o(x) — ox + 8) (14.5.3) 
and the predicted reduction 
Ay = 6x) — yO). (14.5.4) 


208 


Then the ratio 





3) 14.5.5 
aay (14.5.5) 


measures the extent to which ¢ and py) agree local to x) The rules for changing 

h“*) in the model algorithm are those given in Section 5.1 and are not elaborated 

on further, except to emphasize that in practice the rule for reducing h) can be 

more elaborate, based perhaps on some sort of interpolation (Fletcher, 1980b). 
The kth iteration of the model algorithm is as follows. 


(i) Given x) 40) and h) calculate pale gk) ch) AC) and 
Ww) which determine *) and y*(6). 
(ii) Find a global solution 5) to (14.5.2). 
(iii) Evaluate o(x*) + 5()) and calculate Ao, Ay), and rk) 
(iv) If rf <0.25 set Akt) = 16 11/4, 
if 7) > 0.75 and 18) | =A set net) = 20), 
otherwise set h(**!) = n(*), 
(v) If) <0 set x04!) = xO) AEH) = CA) 
else x(¥t1) = x) +§(*) (+1) = multipliers from (14.5.2). 


(14.5.6) 


The parameters 0.25, 0.75, etc., which arise are arbitrary and are not very sensitive 
but the values given here are typical. In fact the solution of problems like (14.5.2) 
has not been considered in this chapter but it is a straightforward extension of these 
ideas (see Watson, 1978, Fletcher and Watson, 1980) and multipliers We line 
an(e*)(8*))) exist at the solution in an analogous way to theorem 14.2.1. It is 
these multipliers that are used in step (v). 

In proving global convergence, a result is used relating the directional derivative 
(14.2.14) of the composite function (14.2.13) and the difference quotient between 
two points x) and x) + €)s in a common direction s (#0), as both points 
approach a fixed point x’. The result is a special case of one due to Clarke (1975) 
but is proved here directly for completeness. 


Lemma 14.5.1 


Let S be the set of all sequences x“) > x', e) + 0, and let c 4 o(x(*)), eles 
and e(*) 4 e(x®) + e's). then 
hc?) — nie 
lim sup ull ee max stA’,, (14.5.7) 
s a) 2. 0n' 


in the sense that the difference quotient is bounded above and the sup of all 
accumulation points of the quotient taken over all sequences in S is the directional 
derivative max ycay'S'A' . 


Proof 
By the integral form of the Taylor series 


1 
ch) = cl) + 6%) |, [A(x + Ges) 7s do = ch) + Hg (14.5.8) 
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say, where d(*) > A'Ts as k > 0, Let u.) Eon) a dh(c{*). Using the subgradient 
inequality and (14.5.8), 

h() > h(c) + (&™ — TAM = ne) — Ma") 
or 


T h(c*)) — h(c 
HC Gp Gs ts ae (one) (14.5.9) 


Since ant”) is bounded in a neighbourhood of x’ (iemma 14.2.1) the difference 
quotient is bounded above. Now consider a subsequence for which the difference 
quotient accumulates, and let 

hcg?) — hee) 
m 


mc) > max, s'A’). (14.5.10) 


li 


Since A ¢*) is bounded there exists a thinner subsequence for which 1.*) > a’ and 
since c*) > ¢’ it follows by lemma 14.2.4 that 4’ € an’. Thus from (14.5.9), 


h(c*?) — he) 
pate Nene) 


li Oo <sTA’y’ < max s'A’A (14.5.11) 
(2 


XE oh' 


which contradicts (14.5.10), so that the reverse inequality (S) in (14.5.10) is true. 
Finally by taking x*) = x’, (14.2.14) shows that there is a sequence in S which 
achieves equality with the directional derivative. Thus (14.5.7) is established. O 


Corollary 
Define 6(x) by (14.2.13) where f€ €'. Then 
OM + es) — 6) _ 


SOP = max s!(g' + A’A). (14:5.12) 
Ss e 2 0n' 


Proof 
The result follows by using an analogous Taylor series for f(x) as in the proof of the 


lemma. O 


It is now possible to state the main result of this section. 


Theorem 14.5.1 


Letx“) € BC IR” V k where B is bounded and let f, c be €? functions whose 
second derivative matrices are bounded on B. Then there exists an accumulation 
point x™ of algorithm (14.5.6) at which first order conditions hold, that is 


max si(g”+A”a)>0 Vis. (14.5.13) 
2.€ 0h 
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Proof 
There exists a convergent subsequence x") + x for which either 


(i) r) < 0.25, h**!) > 0, and hence Il 8) || +0 or 

(ii) rf) > 0.25 and infh™ > 0. 
In either case (14.5.13) is shown to hold. In case (i) let 4 a descent direction s 
(Is l= 1) at x™, that is 


max s! (g* + A”A) = —d, d>0o. (14.5.14) 
1G 0dh™ 


By Taylor series 
= ges) + o(e) (14.5.15) 
by (14.4.12), since 1) is bounded by lemma 14.2.1 and Vf, V7c; are bounded 
by assumption. Likewise by (14.4.6), 
o(x*) + Es) = 9) (es) + o(e(*)) (14.5.16) 
and hence by (14.1.2), the boundedness of dh, and (14.4.11), it follows that 
o(x®) + Hs) = ges) + n(Q (es) + 0(e™)) + of) 
= ges) + nex (es)) + o(e) 
= yes) + o(e). (14.5.17) 
Writing e) = | 8) || and considering a step along s in the subproblem, it follows 
by the optimality of 8) that 
Ay > 6) — yes) 
=p) o(x™® of e(*)s) aE o(e*?) 
> €()(d + o(1)) + o(€) = de + o(e) (14.5.18) 
by the corollary to lemma 14.5.1 and (14.5.14). But (14.5.17) implies that 
Ag = Ay + o(e*) 
and hence rf = Ad /Ay™ = 1 + off) AY™ = 1 + 0(1) from (14.5.8) since 
d> 0, which contradicts r) < 0.25. Thus d <0 for all s and hence (15.5.13) holds 
gt Xoo. 

In case (ii) the argument of case (ii) of theorem 5.1.1 (Volume 1) is largely 
followed (see also Fletcher, 1980a). The only extra step is to deduce from x*) € B 
that dh *) is uniformly bounded for all k, so that the parameters A“) are bounded. 
Thus a thinner subsequence can be chosen such that a‘*) > 4° and hence 
W*) + W®. Then functions g® (8), 2° (8), and W~ (8) are defined and it is con- 
cluded as in theorem S.1.1 that 6= 0 minimizes y~ (§). It follows that the first 


order conditions (14.5.13) hold at x”. In case (ii) it is also possible to conclude 
that second order conditions hold. ba 
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Note that the existence of a bounded region B which the theorem requires is 
implied if any level set {x: ¢(x) <?} is bounded. Also the theorem assumes that 
the sequence {x(*)} is infinite; if not then AY“ = 0 for some k, the iteration 
terminates, and first order conditions are satisfied. 

One point to emphasize about the theorem is that there are no hidden assump- 
tions that certain vectors a; are linearly independent or that the multipliers A“? 
are bounded. Methods for NDO or nonlinear programming can often be proved to 
be convergent under such assumptions, yet can fail in practice. Thus it is important 
that this theorem avoids such assumptions. Another point is that W“) does not 
need to be defined as in (14.4.10) but can be any bounded matrix. Thus the 
theorem indicates that a corresponding quasi-Newton method, using for example 
B“) as defined in (12.3.18) in place of W*), can only fail if B™) becomes 
unbounded. The theorem also subsumes a result due to Madsen (1975) for Lo 
approximation in a first order method like (14.4.8) for which W™) = 0, 

For smooth unconstrained optimization a stronger result (theorem 5.1.1) can 
be proved that the accumulation point also satisfies second order necessary con- 
ditions. To prove this for algorithm (14.5.6) would require the expression 


o(x™ + sg) = yes) + o(e(*)?) 


which however does not hold in NDO applications. Nonetheless when inf h*®)>0, 
then second order necessary conditions do hold (see the proof of theorem 14.5.1). 
Also for smooth problems, and when second order sufficient conditions hold, the 
rate of convergence can be shown to be second order (theorem 5.1.2). This result 
is unlikely to hold in general for algorithm (14.5.6) in an NDO application because 
of the possibility of the Maratos effect (Section 14.4). However if an assumption 
that infh™ > 0 is made, then the unit step of the QL method is taken for k 
sufficiently large, in which case the second order rate of convergence results 
described in Section 14.4 are valid. Current research is aimed at modifications of 
algorithm (14.5.6) which avoid the Maratos effect and enable the second order rate 
to be established more generally (see the Note in proof on p. 214). 


Questions for Chapter 14 


1. For the nonlinear program (14.1.6) prove that ¥° =F’ (defined in Section 
9.2). Lets= (s', S,+1)) be a feasible direction, and use the linearized con- 
straint equations to show that 


Sui > max, sT(g! + A’hy), (2) 
iEa# 


Let 5) | 0 be any sequence, and define x) =x' +s If equality holds in 
(a), and if v(*) = ¢(x)), show that x(*) y(*) gives a feasible directional 
sequence in (14.1.6) for sufficiently large k. If strict inequality holds in (a), 
and if v*) = v* + taker 1, again show that x") y) is a feasible directional 
sequence. 

2. Prove that llc” ll is a convex function of ec when the norm is monotonic 


(Ix! <ly|= Ix I< lly ll). Deduce that ¢§ < (1 — 8)eo + cj as an inter- 
mediate stage. 


. Establish the equivalence between each of (14.1.11) to (14.1.15) and the 


general expression dh(c)=convje_ hy. In (14.1.12) let Aj = 4;, i<m, and 
Um+1 acts as the slack variable for 2; Aj < 1. In (14.1.13) define A; = uj — Umti 
or M; = max (Aj, 0), Uy +; = Max (—Aj, 0). In (14.1.14) use the fact that the 
cube 0 <A; < 1 has extreme points which are all combinations of 1 and 0, and 
similarly for (14.1.15). 


. Establish the equivalence between the subdifferential expression 


dlicl={A: let+thl>Iclh+ath Vh} (b) 


and (14.3.7). Use the generalized Cauchy inequality a’ b < lla Illlb Ilp on 

(c + h)'X to show that A€ (14.3.7) implies AE (b). If AE (b), use the triangle 
inequality to show that h’a < I|h ll V h and (14.3.4) to show ll Allp <1. 
Hence A! c< llc ll. Then with h = —c in (b) show that Ale > lle ll and hence 
we (14317). 


. If the norm is monotonic (see Question 14.2) establish the equivalence 


between 
dllcT W={A: N(eth)t I> let l+ath Vh} (c) 


and (14.3.8). The proof is similar to that in Question 14.4. In the first part also 
use the fact that 4 > 0 implies (c + h)''” > (c+ h)!X. In the second part also 
use the monotonic norm property to establish ||h* ll > h'a V h. Then h = —e; 
yields \; > 0. Use h* <|h| to show Il Allp < 1. Then proceed much as in 
Question 14.4. 


. Show that the subdifferential in (14.2.2) is a closed convex set. 


. Justify equation (14.2.12) when 0 € df’. Show by straightforward arguments 
that 
min max s'g> max min s'g=—llg ll. 
isi, =1 geof’ geof’ Iisil,=1 


Then use the separating hyperplane result (lemma 14.2.3) to show that equality 
is achieved when s = —g/|lg ll,. Show by considering f= | x| that the result is 
not true when 0 € df’. The result can be generalized to include the case 0 € of’ 
by writing Ils ll, <1 in place of Ils ll, = 1. 


. Consider the Freudenstein and Roth equations 


CuK) = X4 Ss POxe = 2x5 13 

C2 (x)=x, +x3+ x2 — 14x, —29 
(see Chapter 6, Volume 1) and consider minimizing Ile(x) Il, for 1 <p<e, 
A local solution in all cases is x* = (11.4128, —0. 8968)". Find the sets 
dllc*|l,, (that is 0h*) for p = 1, 2, 0 and the vector 4* which satisfies theorem 
14.2. re For p = 1 the local solution is not unique and any x; € [6.4638, 
11.4128] gives a solution. Find a lle* ll, when xf = 6.4638. 
Consider minimizing llc ll; when the equation c, (x) in Question 14.8 is scaled 
by multiplying the right hand side by 2. Show that x* = (6.4638, —0.8968)! 


10. 


oi, 


12: 


13. 


14. 
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satisfies the second order sufficient conditions of theorem 14.2.3 and find 4*, 
Find a trajectory which is feasible in (14.2.19) (see Figure 14.2.3) and hence 
find the sets Y* and G* and verify that they are non-empty and equal. Verify 
that the vectors a;, i © Z, are linearly independent which implies Y* = G* 
(see 14.2.27) ff.). 

Consider minimizing IIc ll. for the system which results on adding an equation 
€3(x) = ax, to those in Question 14.8. If w= 0.4336 show that x* = (11.4128, 
—0.8968)! satisfies the second order sufficient conditions of theorem 14.2.3 
and find 4*. Find a trajectory which is feasible in (14.2.19) and hence find the 
sets Y* and G” and verify that they are non-empty and equal. Verify that 
(14.2.25) is satisfied which implies Y* = G*. 

Show that the vectors in (14.2.25) are independent iff the vectors 


* * 
aig eA 
( 8 1 ‘) are independent. To do this let the vectors be the columns of 


matrices B and ie | respectively, where 


rn PB 0 gt A*h, T 
eu [oe fo |e perae 
ore C 
Clearly if me has full rank then so has B. If a has not full rank then there 


IT 
exists U= is ) # 0 such that or u = 0. Show that this implies that B does 
n+] 


not have full rank. 

Consider the problem defined in (14.4.2). At x~ find the multipliers A” for 
which g~ + A” 1” = 0 and show that Az <0 which implies that $(x) can be 
reduced by making the inequality v > c2(x) in (14.1.6) inactive (see end of 
Section 14.1). Find the multipliers 4* at x* and show that the conditions of 
theorem 14.2.1 hold. Apply the method of (14.4.8) from x) = (0, —4) and 
verify that a second order rate of convergence is obtained. Why does this 
happen in the absence of curvature information? 

In problem (14.3.5) let x) lie on the unit circle arbitrarily close to x*, and let 
4@) = 4*. Show that there exists a range of values of v with vy < 1/A* for which 
the unit step determined by solving (14.4.11) fails to reduce (x). 

Consider the unconstrained NDO problem: min f(x) + 272, |7;(x)|, where 

r(x) = Alx + b (x € IR”) and f(x) is convex. By introducing variables 


re = max(/;, 0), ie = max(—/;, 0), 
show thatr;* —r; 20,7; +7; >0, andr; +r; =|1;|. Hence show that the 
unconstrained problem can be restated as 


minimize f(x) +el(r* +r) 
Meare 
subject to r* — AIx— b>0 
Pork xtbe 0, rr 201 20, 


wheree=(1, 1,..., 1)!. Show that this is a convex programming problem. 
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Write down the dual of this problem, denoting the multipliers of the constraints 
by A*, 27, p*, wo respectively. By eliminating p* and pe, and writing 
4* —1.~ =1, show that the dual can be restated more simply as 


eres f(x) + AT (Alx + bd) 
X, 


subject to g(x) + AA=0, —ex<i<e 


’ 


where g= V,, f (see also Watson, 1978). 


Note added in proof: some recent suggestions for averting the Maratos effect are 
given in the paper ‘Second order corrections for nondifferentiable optimization’ 


by R. Fletcher in Numerical Analysis — Dundee 1981 (Ed. G. A. Watson), Springer- 
Verlag, Berlin (to be published). 
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